618 Project - Cosmetics Product Analysis Part 2¶

Title:

Uncovering Market Trends and Key Drivers of Customer Satisfaction: A Data Analysis of Sephora Products and Reviews

Team Members:

  • Ruizhe Wang (Unique Name: wrzwrz)
  • Zhiyi Ji (Unique Name: jizhiyi)

Part 1¶

Overview¶

This project focuses on analyzing the Sephora Products and Skincare Reviews Dataset, which contains detailed information on over 8,000 beauty products and 1 million customer reviews. The dataset provides key insights into the cosmetic industry through product attributes, prices, ingredients, and customer feedback.

Our primary objective is to identify factors that drive higher product ratings and understand how ingredients, prices, and product types influence customer satisfaction. Additionally, we aim to explore why certain products underperform by analyzing user reviews. Beyond ratings, we will investigate trends in consumer sentiment over time and assess the impact of movements like "clean beauty" and seasonal variations in feedback.

Using data manipulation, sentiment analysis, and visualization, we intend to uncover patterns in customer behavior, empowering brands to align their strategies with customer needs and industry trends.

Motivation¶

The cosmetic and skincare industry is evolving rapidly, with consumers becoming more focused on tailored solutions and sustainable products. Customer reviews are now essential in shaping brand reputations and guiding purchase decisions. This project is motivated by a desire to understand what drives customer satisfaction, allowing businesses to improve product offerings based on consumer needs.

Specifically, we aim to address the following 3 real-world questions:

  1. What factors drive higher product ratings? Why do certain products underperform?

    • What we hope to learn: Identify which product attributes (e.g., ingredients, price) are linked to higher satisfaction and uncover common issues affecting low-rated products. We also want to generate methods to resolve the issues.
  2. How does customer sentiment change over time?

    • What we hope to learn: Analyze shifts in customer feedback and assess the impact of trends (e.g. "clean beauty") to better understand evolving preferences.
  3. Are certain ingredients linked to positive or negative customer experiences?

    • What we hope to learn: Detect patterns in ingredient preferences and identify sources of customer dissatisfaction.
  4. How certain ingredients link to effectiveness for customers with different skin types?

    • What we hope to learn: Identify patterns in ingredient effectiveness across various skin types and determine which ingredients are most beneficial or detrimental for specific skin types.

Through this analysis, we hope to generate actionable insights that help brands meet customer expectations, avoid problematic ingredients, and enhance customer experiences.

Data Sources¶

Our project utilizes two primary datasets, which complement each other by linking product attributes with customer feedback, providing the comprehensive information for our analysis.

This data was collected in March 2023 and provides detailed information on a variety of beauty and skincare products available on Sephora's online store. It includes both product details and customer reviews, offering a comprehensive view of customer feedback and product features. The dataset is publicly available on Kaggle, a popular data platform, making it suitable for exploratory analysis and machine learning tasks.

URL:

  • Kaggle: Sephora Products and Skincare Reviews Dataset

Description:

  1. Product Information Dataset (product_info.csv)

    • Contains detailed data on 8,000+ products from various brands, covering the following attributes:
      • Product name and brand
      • Price
      • Ingredients and product type (e.g., moisturizer, serum)
      • Product features (e.g., organic, vegan, chemical-free)
    • Purpose: This dataset allows us to analyze product pricing, ingredients, average ratings and categorize product types.
  2. Customer Reviews Dataset

    • Split across multiple files: review_0-250.csv, review_250-500.csv, review_500-750.csv, review_750-1250.csv, review_1250-end.csv.
    • Contains 1 million+ reviews on over 2,000 skincare products, with key data fields:
      • Ratings: Numerical scores reflecting customer satisfaction
      • Review text: Customer opinions, experiences, and feedback
      • Review metadata: User details (e.g., skin type) and peer ratings on the helpfulness of reviews
      • Date: Submitting date of the reviews.
    • Purpose: This dataset helps us explore customer feedbacks and sentiment and detect market trends.
How the Datasets Complement Each Other¶

The product information dataset provides insight into product attributes such as ingredients, pricing and etc., while the customer reviews dataset offers feedback on how these products perform from the customer’s perspective. Together, these datasets offer a comprehensive view of the relationship between product attributes and consumer experiences, enabling us to uncover actionable insights for brands in the cosmetic industry.

It is worth noting that the customer reviews dataset covers reviews on only 2,000 skincare products, while the product information dataset includes details on 8,000+ products across various categories.

Data Description¶

In [171]:
# import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud

Product Information¶

Firstly, we load the data from existing csv file and get an idea of its shape as well as fields included.

In [172]:
product_df = pd.read_csv("Sephora/product_info.csv")
In [173]:
print("product info dataframe shape: ", product_df.shape)
product_df.head()
product info dataframe shape:  (8494, 27)
Out[173]:
product_id product_name brand_id brand_name loves_count rating reviews size variation_type variation_value variation_desc ingredients price_usd value_price_usd sale_price_usd limited_edition new online_only out_of_stock sephora_exclusive highlights primary_category secondary_category tertiary_category child_count child_max_price child_min_price
0 P473671 Fragrance Discovery Set 6342 19-69 6320 3.6364 11.0 NaN NaN NaN NaN ['Capri Eau de Parfum:', 'Alcohol Denat. (SD A... 35.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Warm &Spicy Scen... Fragrance Value & Gift Sets Perfume Gift Sets 0 NaN NaN
1 P473668 La Habana Eau de Parfum 6342 19-69 3827 4.1538 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 85.0 30.0
2 P473662 Rainbow Bar Eau de Parfum 6342 19-69 3253 4.2500 16.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0
3 P473660 Kasbah Eau de Parfum 6342 19-69 3018 4.4762 21.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0
4 P473658 Purple Haze Eau de Parfum 6342 19-69 2691 3.2308 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN ['Alcohol Denat. (SD Alcohol 39C), Parfum (Fra... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0

To get an idea of the general information of the dataset both intuitively and statistically, we use .info() and .describe() correspondingly to have basic information result, and the outcome is displayed as follows.

In [174]:
print("Basic Information:")
product_df.info()
Basic Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8494 entries, 0 to 8493
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   product_id          8494 non-null   object 
 1   product_name        8494 non-null   object 
 2   brand_id            8494 non-null   int64  
 3   brand_name          8494 non-null   object 
 4   loves_count         8494 non-null   int64  
 5   rating              8216 non-null   float64
 6   reviews             8216 non-null   float64
 7   size                6863 non-null   object 
 8   variation_type      7050 non-null   object 
 9   variation_value     6896 non-null   object 
 10  variation_desc      1250 non-null   object 
 11  ingredients         7549 non-null   object 
 12  price_usd           8494 non-null   float64
 13  value_price_usd     451 non-null    float64
 14  sale_price_usd      270 non-null    float64
 15  limited_edition     8494 non-null   int64  
 16  new                 8494 non-null   int64  
 17  online_only         8494 non-null   int64  
 18  out_of_stock        8494 non-null   int64  
 19  sephora_exclusive   8494 non-null   int64  
 20  highlights          6287 non-null   object 
 21  primary_category    8494 non-null   object 
 22  secondary_category  8486 non-null   object 
 23  tertiary_category   7504 non-null   object 
 24  child_count         8494 non-null   int64  
 25  child_max_price     2754 non-null   float64
 26  child_min_price     2754 non-null   float64
dtypes: float64(7), int64(8), object(12)
memory usage: 1.7+ MB
In [175]:
print("\nSummary Statistics (Numerical):")
product_df.describe()
Summary Statistics (Numerical):
Out[175]:
brand_id loves_count rating reviews price_usd value_price_usd sale_price_usd limited_edition new online_only out_of_stock sephora_exclusive child_count child_max_price child_min_price
count 8494.000000 8.494000e+03 8216.000000 8216.000000 8494.000000 451.000000 270.000000 8494.000000 8494.000000 8494.000000 8494.000000 8494.000000 8494.000000 2754.000000 2754.000000
mean 5422.440546 2.917957e+04 4.194513 448.545521 51.655595 91.168537 20.207889 0.070285 0.071698 0.219096 0.073699 0.279374 1.631622 53.792023 39.665802
std 1709.595957 6.609212e+04 0.516694 1101.982529 53.669234 79.195631 24.327352 0.255642 0.258002 0.413658 0.261296 0.448718 5.379470 58.765894 38.685720
min 1063.000000 0.000000e+00 1.000000 1.000000 3.000000 0.000000 1.750000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.000000 3.000000
25% 5333.000000 3.758000e+03 3.981725 26.000000 25.000000 45.000000 8.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 22.000000 19.000000
50% 6157.500000 9.880000e+03 4.289350 122.000000 35.000000 67.000000 14.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 32.000000 28.000000
75% 6328.000000 2.684125e+04 4.530525 418.000000 58.000000 108.500000 25.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000 59.000000 42.000000
max 8020.000000 1.401068e+06 5.000000 21281.000000 1900.000000 617.000000 320.000000 1.000000 1.000000 1.000000 1.000000 1.000000 105.000000 570.000000 400.000000

Customer Review¶

We load data of customer review into dataframes and concat them together into one complete dataframe altogether. after that, we apply similar analysis methods to have an intuitive overview. The results are displayed as follows.

In [176]:
# Reference: https://www.kaggle.com/code/themeeemul/sephora-eda-and-sentiment-analysis-using-pytorch
review_df_1 = pd.read_csv("Sephora/reviews_0-250.csv",index_col = 0, dtype={'author_id':'str'})
review_df_2 = pd.read_csv("Sephora/reviews_250-500.csv",index_col = 0, dtype={'author_id':'str'})
review_df_3 = pd.read_csv("Sephora/reviews_500-750.csv",index_col = 0, dtype={'author_id':'str'})
review_df_4 = pd.read_csv("Sephora/reviews_750-1250.csv",index_col = 0, dtype={'author_id':'str'})
review_df_5 = pd.read_csv("Sephora/reviews_1250-end.csv",index_col = 0, dtype={'author_id':'str'})

# Merge review_df_1 till review_df_6
review_df = pd.concat([review_df_1, review_df_2, review_df_3, review_df_4, review_df_5],axis=0)
print("review_df shape: ",review_df.shape)
review_df shape:  (1094411, 18)
In [177]:
review_df.sample(5)
Out[177]:
author_id rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count submission_time review_text review_title skin_tone eye_color skin_type hair_color product_id product_name brand_name price_usd
318520 2506218805 5 1.0 NaN 0 0 0 2020-01-22 This product really surprised me!! Being on Ac... So helpful with Accutane!! light blue combination blonde P442840 Barrier+ Triple Lipid-Peptide Face Cream Skinfix 54.0
47147 9153021874 5 1.0 1.0 2 0 2 2020-07-07 This is unlike any micellar water I’ve tried b... Texturally amazing fair hazel combination brown P462347 E-Rase Milki Micellar Water Mini Drunk Elephant 14.0
184012 1476710027 5 1.0 NaN 0 0 0 2020-08-10 I’ve been using this product for a couple week... So Far, So Good fair brown normal auburn P460779 Skinlongevity Long Life Herb Anti-Aging Face S... bareMinerals 65.0
167906 1419803262 3 1.0 1.0 2 0 2 2020-08-29 I tried this out for a whole week to see how t... Good night moisturizer medium brown oily brown P461213 Honey Balm Niacinamide Moisturizer Wishful 43.0
179021 2489687625 5 1.0 1.0 2 0 2 2022-07-01 Love this product for summer and lighter makeu... NaN tan brown combination brown P500245 C Beyond Triple Serum SPF 40 Mineral Sunscreen... ILIA 64.0
In [178]:
print("Basic Information:")
review_df.info()
Basic Information:
<class 'pandas.core.frame.DataFrame'>
Index: 1094411 entries, 0 to 49976
Data columns (total 18 columns):
 #   Column                    Non-Null Count    Dtype  
---  ------                    --------------    -----  
 0   author_id                 1094411 non-null  object 
 1   rating                    1094411 non-null  int64  
 2   is_recommended            926423 non-null   float64
 3   helpfulness               532819 non-null   float64
 4   total_feedback_count      1094411 non-null  int64  
 5   total_neg_feedback_count  1094411 non-null  int64  
 6   total_pos_feedback_count  1094411 non-null  int64  
 7   submission_time           1094411 non-null  object 
 8   review_text               1092967 non-null  object 
 9   review_title              783757 non-null   object 
 10  skin_tone                 923872 non-null   object 
 11  eye_color                 884783 non-null   object 
 12  skin_type                 982854 non-null   object 
 13  hair_color                867643 non-null   object 
 14  product_id                1094411 non-null  object 
 15  product_name              1094411 non-null  object 
 16  brand_name                1094411 non-null  object 
 17  price_usd                 1094411 non-null  float64
dtypes: float64(3), int64(4), object(11)
memory usage: 158.6+ MB
In [179]:
print("\nSummary Statistics (Numerical):")
review_df.describe()
Summary Statistics (Numerical):
Out[179]:
rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count price_usd
count 1.094411e+06 926423.000000 532819.000000 1.094411e+06 1.094411e+06 1.094411e+06 1.094411e+06
mean 4.299158e+00 0.839962 0.767782 4.177126e+00 8.948695e-01 3.282257e+00 4.900838e+01
std 1.149444e+00 0.366642 0.317164 2.271524e+01 5.288943e+00 1.967482e+01 4.004338e+01
min 1.000000e+00 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00
25% 4.000000e+00 1.000000 0.652174 0.000000e+00 0.000000e+00 0.000000e+00 2.500000e+01
50% 5.000000e+00 1.000000 0.928571 0.000000e+00 0.000000e+00 0.000000e+00 3.900000e+01
75% 5.000000e+00 1.000000 1.000000 3.000000e+00 1.000000e+00 3.000000e+00 6.200000e+01
max 5.000000e+00 1.000000 1.000000 5.464000e+03 1.159000e+03 5.050000e+03 1.900000e+03

Note: The missing value part will be introduced in section Data Cleaning.

Data Manipulation & Visualization¶

After the first step of importing, merging and describing data, we go a step further into data manipulation, to process on the data and turn it into a more tidy and easy-to-use format. Alongside, we conduct data visualization to provide us a straightforward display of all kinds of data distribution, so as to give us a clearer sense of the data. Aspects include customers' rating, item price, size, etc.

Rating¶

To see the overall reputation condition of products, we visualized the distribution of customers' rating on all products. Shown below we can see that large percentage of customers give a rating between 4.5-5.0/5.0, indicating that they're satisfied with the products.

In [180]:
# Customer Rating Distribution
plt.figure(figsize=(8, 5))
review_df['rating'].hist(bins=8, color='lightcoral', edgecolor='black')
plt.title('Customer Rating Distribution')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()
Price¶

To see the price distribution of all products, we create the visualization below. To avoid outliers, we will only visualize products with price <= 350 USD.

In [181]:
# Price Distribution
# To avoid outliers, we will only visualize products with price <= 350 USD
plt.figure(figsize=(8, 5))
product_df[product_df['price_usd'] <= 350]['price_usd'].hist(bins=30, color='gold', edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price (USD)')
plt.ylabel('Frequency')
plt.show()
Size¶

To see the size of the products, we extract numerical values from certain columns and conduct some datatype conversion to get get the size in both oz and ml.

In [182]:
product_df['size']
Out[182]:
0                   NaN
1        3.4 oz/ 100 mL
2        3.4 oz/ 100 mL
3        3.4 oz/ 100 mL
4        3.4 oz/ 100 mL
             ...       
8489                NaN
8490       2 oz / 60 mL
8491                NaN
8492                NaN
8493    .11 oz / 3.2 mL
Name: size, Length: 8494, dtype: object
In [183]:
product_df['size_oz'] = product_df['size'].str.extract(r'(\d+\.?\d*)\s*oz', expand=False).astype(float)
print(product_df[['size', 'size_oz']].sample(10))
                size  size_oz
1294             NaN      NaN
4859    6 oz/ 177 mL     6.00
6457             NaN      NaN
414    1.7 oz/ 50 mL     1.70
5919             NaN      NaN
56      16 oz/ 453 g    16.00
4917  0.12 oz/ 3.4 g     0.12
7706             NaN      NaN
350    1.7 oz/ 50 mL     1.70
2747  6.5 oz/ 192 mL     6.50
In [184]:
product_df['size_ml'] = product_df['size'].str.extract(r'(\d+\.?\d*)\s*mL', expand=False)
product_df['size_ml'] = product_df['size_ml'].astype(float)
print(product_df[['size', 'size_ml']].sample(10))
                size  size_ml
5067   1.7 oz/ 50 mL     50.0
8248     2 oz/ 60 mL     60.0
6149             NaN      NaN
538              NaN      NaN
6929   1.7 oz/ 50 mL     50.0
6306  6.8 oz/ 200 mL    200.0
4927             NaN      NaN
2355   7.0 oz/ 198 g      NaN
7556  0.33 oz/ 11 mL     11.0
1114   1.6 oz/ 50 mL     50.0

Price Per Unit¶

To get an idea of the price distribution of products, we conduct some calculation and visualization to see the distribution. The results are show below.

In [185]:
product_df['price_per_oz'] = product_df['price_usd'] / product_df['size_oz']
product_df['price_per_ml'] = product_df['price_usd'] / product_df['size_ml']
product_df[['price_usd', 'size_oz', 'price_per_oz', 'size_ml', 'price_per_ml']].sample(10)
Out[185]:
price_usd size_oz price_per_oz size_ml price_per_ml
2114 67.0 1.70 39.411765 50.0 1.34
5667 22.0 1.70 12.941176 50.0 0.44
1264 38.0 NaN NaN NaN NaN
5186 71.0 1.70 41.764706 50.0 1.42
1368 29.0 0.34 85.294118 10.0 2.90
8427 20.0 NaN NaN NaN NaN
2079 150.0 1.69 88.757396 50.0 3.00
3419 27.0 0.42 64.285714 12.5 2.16
5513 29.0 NaN NaN NaN NaN
7427 22.0 NaN NaN NaN NaN
In [186]:
# Price Distribution
# To avoid outliers, we will only visualize products with price <= 350 USD
plt.figure(figsize=(8, 5))
product_df[product_df['price_per_oz'] <= 500]['price_per_oz'].hist(bins=30, color='gold', edgecolor='black')
plt.title('Price Distribution')
plt.xlabel('Price (USD)')
plt.ylabel('Frequency')
plt.show()

Category¶

To have an idea of how many items belong to each category, we conduct counting on both primary and secondary categories.

In [187]:
product_df['primary_category'].value_counts()
Out[187]:
primary_category
Skincare           2420
Makeup             2369
Hair               1464
Fragrance          1432
Bath & Body         405
Mini Size           288
Men                  60
Tools & Brushes      52
Gifts                 4
Name: count, dtype: int64
In [188]:
print(product_df['secondary_category'].shape)
product_df['secondary_category'].value_counts()
(8494,)
Out[188]:
secondary_category
Women                        875
Hair Styling & Treatments    757
Eye                          711
Face                         659
Moisturizers                 551
Value & Gift Sets            498
Treatments                   466
Shampoo & Conditioner        431
Lip                          411
Cleansers                    361
Candles & Home Scents        263
Brushes & Applicators        246
Body Moisturizers            220
Mini Size                    187
Eye Care                     186
Masks                        166
Cheek                        165
Tools                        153
Makeup                       137
Men                          135
Sunscreen                    109
Skincare                      98
Bath & Shower                 84
High Tech Tools               80
Wellness                      79
Body Care                     69
Self Tanners                  64
Lip Balms & Treatments        61
Hair                          59
Nail                          52
Accessories                   45
Beauty Tools                  23
Makeup Palettes               20
Fragrance                     15
Shaving                       15
Hair Tools                    11
Bath & Body                    7
Shop by Concern                5
Beauty Accessories             5
Other Needs                    5
Beauty Supplements             2
Name: count, dtype: int64

Loves Count¶

In [189]:
# Price Distribution
# To avoid outliers, we will only visualize products with price <= 350 USD
plt.figure(figsize=(8, 5))
product_df[product_df['loves_count'] < 400000]['loves_count'].hist(bins=30, color='gold', edgecolor='black')
plt.title('loves Count Distribution')
plt.xlabel('Count')
plt.ylabel('Frequency')
plt.show()

Positive/Negative Feedback Percentage¶

Also we calculate and visualize on the positive and negative feedback percentages of customers, as another indication of their satisfactory level.

In [190]:
review_df['pos_feedback_percentage'] = review_df['total_pos_feedback_count']/review_df['total_feedback_count']
review_df['neg_feedback_percentage'] = review_df['total_neg_feedback_count']/review_df['total_feedback_count']
review_df[['total_pos_feedback_count', 'total_feedback_count', 'pos_feedback_percentage', 'total_neg_feedback_count', 'neg_feedback_percentage']].sample(5)
Out[190]:
total_pos_feedback_count total_feedback_count pos_feedback_percentage total_neg_feedback_count neg_feedback_percentage
452991 0 0 NaN 0 NaN
11054 0 1 0.000000 1 1.000000
115964 1 3 0.333333 2 0.666667
89870 0 6 0.000000 6 1.000000
117871 0 0 NaN 0 NaN
In [191]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle('Feedback Percentage Distributions')

axes[0].hist(review_df['pos_feedback_percentage'], bins=30, color='green', edgecolor='black')
axes[0].set_title('Positive Feedback Percentage')
axes[0].set_xlabel('Percentage')
axes[0].set_ylabel('Number of Products')

axes[1].hist(review_df['neg_feedback_percentage'], bins=30, color='red', edgecolor='black')
axes[1].set_title('Negative Feedback Percentage')
axes[1].set_xlabel('Percentage')
axes[1].set_ylabel('Number of Products')

plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

Ingredient¶

To see the ingredients contained in the products, we processed the strings in the ingredients column to extract certain fields, thus making some inferences with the help of visualization tools.

In [192]:
product_df['ingredients'] = product_df['ingredients'].str.strip("[]").str.replace("'", "").str.split(", ")
all_ingredients = product_df['ingredients'].explode().dropna().tolist()
ingredient_counts = Counter(all_ingredients)
ingredient_df = pd.DataFrame(ingredient_counts.items(), columns=['Ingredient', 'Count']).sort_values(by='Count', ascending=False)
In [193]:
ingredient_df.head(5)
Out[193]:
Ingredient Count
191 Glycerin 4171
212 Phenoxyethanol 3862
211 Caprylyl Glycol 2747
321 Tocopherol 2681
20 Limonene 2537
In [194]:
top20_ingredients = [ingredient for ingredient, _ in ingredient_counts.most_common(20)]

# Add column in product_df for each of the top 20 ingredients for further analysis
for ingredient in top20_ingredients:
    column_name = f'contains_{ingredient}'
    product_df[column_name] = product_df['ingredients'].apply(lambda x: ingredient in x if isinstance(x, list) else False)
In [195]:
columns_to_display = ['product_id'] + [f'contains_{ingredient}' for ingredient in top20_ingredients]
product_df[columns_to_display].head()
Out[195]:
product_id contains_Glycerin contains_Phenoxyethanol contains_Caprylyl Glycol contains_Tocopherol contains_Limonene contains_Ethylhexylglycerin contains_Dimethicone contains_Linalool contains_Silica contains_Butylene Glycol contains_Citric Acid contains_Mica contains_Potassium Sorbate contains_Water contains_Caprylic/Capric Triglyceride contains_Sodium Benzoate contains_Citronellol contains_Tocopheryl Acetate contains_Propanediol contains_Geraniol
0 P473671 False False False False True False False True False False False False False False False False True False False True
1 P473668 False False False False False False False False False False False False False False False False False False False False
2 P473662 False False False False False False False True False False False False False False False False False False False False
3 P473660 False False False False False False False True False False False False False False False False True False False True
4 P473658 False False False False False False False True False False False False False False False False False False False True
In [196]:
plt.figure(figsize=(12, 6))
ingredient_df.head(20).plot(kind='barh', x='Ingredient', y='Count', legend=False, color='skyblue')
plt.title('Top 20 Most Common Ingredients in Products')
plt.xlabel('Count')
plt.ylabel('Ingredient')
plt.gca().invert_yaxis()
plt.show()
<Figure size 1200x600 with 0 Axes>
In [197]:
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(ingredient_counts)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Ingredients')
plt.show()

Merge Two Datasets¶

After all of the work and analysis above, we then merge the two datasets together for more convenient future work and analysis. The process is shown below.

In [198]:
# Reference: https://www.kaggle.com/code/themeeemul/sephora-eda-and-sentiment-analysis-using-pytorch
# Merge product_df and review_df
cols_to_use = product_df.columns.difference(review_df.columns)
cols_to_use = list(cols_to_use)
cols_to_use.append('product_id')

Sephora_df = pd.merge(review_df, product_df[cols_to_use], how='outer', on=['product_id', 'product_id'])
print("Sephora Shape: ", Sephora_df.shape)
Sephora Shape:  (1100554, 66)
In [199]:
Sephora_df.sample(5)
Out[199]:
author_id rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count submission_time review_text review_title skin_tone eye_color skin_type hair_color product_id product_name brand_name price_usd pos_feedback_percentage neg_feedback_percentage brand_id child_count child_max_price child_min_price contains_Butylene Glycol contains_Caprylic/Capric Triglyceride contains_Caprylyl Glycol contains_Citric Acid contains_Citronellol contains_Dimethicone contains_Ethylhexylglycerin contains_Geraniol contains_Glycerin contains_Limonene contains_Linalool contains_Mica contains_Phenoxyethanol contains_Potassium Sorbate contains_Propanediol contains_Silica contains_Sodium Benzoate contains_Tocopherol contains_Tocopheryl Acetate contains_Water highlights ingredients limited_edition loves_count new online_only out_of_stock price_per_ml price_per_oz primary_category reviews sale_price_usd secondary_category sephora_exclusive size size_ml size_oz tertiary_category value_price_usd variation_desc variation_type variation_value
407789 11958462128 5.0 1.0 1.0 7.0 0.0 7.0 2020-01-13 I LOVE this product! After seeing reviews for ... Little miracle bottle is what it is.... light hazel combination red P427416 Caffeine 5% + EGCG Depuffing Eye Serum The Ordinary 8.90 1.0 0.0 6234 0 NaN NaN False False False False False False True False True False False False True False True False False False False False ['Vegan', 'Community Favorite', 'Good for: Dar... [Aqua (Water), Caffeine, Maltodextrin, Glyceri... 0 281928 0 0 0 0.296667 8.900000 Skincare 2118.0 NaN Treatments 0 1 oz/ 30 mL 30.0 1.0 Face Serums NaN NaN Size 1 oz/ 30 mL
932672 2077003084 5.0 1.0 NaN 0.0 0.0 0.0 2022-07-02 The best! It is cheap and works amazing. I lit... NaN lightMedium NaN dry NaN P478030 Mini Oat Cleansing Balm The INKEY List 5.99 NaN NaN 6285 0 NaN NaN False False False False False False False False False False False False True False False True False True False False ['allure 2020 Best of Beauty Award Winner', 'B... [Prunus Amygdalus Dulcis (Sweet Almond) Oil, C... 0 40474 0 0 0 0.119800 3.523529 Skincare 2968.0 NaN Cleansers 1 1.7 oz/ 50 mL 50.0 1.7 Face Wash & Cleansers NaN NaN Size 1.7 oz/ 50 mL
69763 1325856411 5.0 1.0 NaN 0.0 0.0 0.0 2019-11-12 This is the best moisturizer I’ve ever used!!!... AMAZING!!!!! lightMedium brown combination black P248407 Ultra Repair Cream Intense Hydration First Aid Beauty 38.00 NaN NaN 5972 3 48.0 18.0 False True True False False True False False True False False False True False False False False False False True ['Best for Dry Skin', 'Community Favorite', 'C... [Colloidal Oatmeal 0.50%, Water, Stearic Acid,... 0 300432 0 0 0 NaN 6.333333 Skincare 7539.0 NaN Moisturizers 0 6 oz/ 170 g NaN 6.0 Moisturizers NaN NaN Size 6 oz/ 170 g
893051 31136105305 5.0 1.0 NaN 0.0 0.0 0.0 2021-07-08 I received a sample of this and I am very impr... NaN fair NaN normal NaN P473820 Advanced Night Repair Eye Concentrate Matrix S... Estée Lauder 82.00 NaN NaN 6089 0 NaN NaN True False False False False True True False True False False False True True False False False False True False ['Good for: Loss of firmness', 'Good for: Dark... [Water/Aqua/Eau, Dimethicone, Bifida Ferment L... 0 5480 0 1 0 5.466667 164.000000 Skincare 1096.0 NaN Treatments 0 0.5 oz/ 15 mL 15.0 0.5 Face Serums NaN NaN NaN NaN
306974 8487209118 5.0 1.0 NaN 0.0 0.0 0.0 2018-05-08 I love all of the products from Sunday Riley. ... Sunday Riley is the best in the business. fair green normal brown P418629 C.E.O. Vitamin C Brightening Rich Hydration Mo... Sunday Riley 22.00 NaN NaN 6073 1 65.0 65.0 True True True True False False False False True True True False True False False False False True False True ['Good for: Dullness/Uneven Texture', 'Vitamin... [Water, Squalane, Dicaprylyl Carbonate, Tetrah... 0 5905 0 0 0 NaN 44.000000 Skincare 1101.0 NaN Moisturizers 0 0.5 oz/ 15 g NaN 0.5 Moisturizers NaN NaN Size 0.5 oz/ 15 g
In [200]:
Sephora_df.describe()
Out[200]:
rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count price_usd pos_feedback_percentage neg_feedback_percentage brand_id child_count child_max_price child_min_price limited_edition loves_count new online_only out_of_stock price_per_ml price_per_oz reviews sale_price_usd sephora_exclusive size_ml size_oz value_price_usd
count 1.094411e+06 926423.000000 532819.000000 1.094411e+06 1.094411e+06 1.094411e+06 1.094411e+06 532819.000000 532819.000000 1.100554e+06 1.100554e+06 455758.000000 455758.000000 1.100554e+06 1.100554e+06 1.100554e+06 1.100554e+06 1.100554e+06 854407.000000 992297.000000 1.100276e+06 9978.000000 1.100554e+06 854407.000000 992297.000000 31189.000000
mean 4.299158e+00 0.839962 0.767782 4.177126e+00 8.948695e-01 3.282257e+00 4.900838e+01 0.767782 0.232218 5.547432e+03 7.803261e-01 58.249297 30.317296 1.715227e-02 1.064307e+05 2.680832e-02 1.106088e-01 3.182942e-02 1.345774 40.577942 2.151789e+03 16.888082 3.239396e-01 67.240317 3.073330 114.830038
std 1.149444e+00 0.366642 0.317164 2.271524e+01 5.288943e+00 1.967482e+01 4.004338e+01 0.317164 0.317164 1.493821e+03 1.344017e+00 62.039099 28.813499 1.298387e-01 1.674201e+05 1.615229e-01 3.136473e-01 1.755458e-01 1.571010 141.789132 2.524573e+03 13.961803 4.679775e-01 55.643145 7.784536 62.313042
min 1.000000e+00 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 3.000000e+00 0.000000 0.000000 1.063000e+03 0.000000e+00 3.000000 3.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.020833 0.130682 1.000000e+00 1.750000 0.000000e+00 0.160000 0.000200 0.000000
25% 4.000000e+00 1.000000 0.652174 0.000000e+00 0.000000e+00 0.000000e+00 2.500000e+01 0.652174 0.000000 5.626000e+03 0.000000e+00 21.000000 17.000000 0.000000e+00 1.630000e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.333333 9.990000 5.390000e+02 7.000000 0.000000e+00 30.000000 1.000000 68.000000
50% 5.000000e+00 1.000000 0.928571 0.000000e+00 0.000000e+00 0.000000e+00 3.900000e+01 0.928571 0.071429 6.073000e+03 0.000000e+00 40.000000 22.000000 0.000000e+00 4.814300e+04 0.000000e+00 0.000000e+00 0.000000e+00 0.950000 26.470588 1.337000e+03 18.000000 0.000000e+00 50.000000 1.700000 102.000000
75% 5.000000e+00 1.000000 1.000000 3.000000e+00 1.000000e+00 3.000000e+00 6.200000e+01 1.000000 0.347826 6.234000e+03 1.000000e+00 71.000000 30.000000 0.000000e+00 1.234390e+05 0.000000e+00 0.000000e+00 0.000000e+00 1.866667 50.000000 2.891000e+03 20.000000 1.000000e+00 95.000000 3.380000 142.000000
max 5.000000e+00 1.000000 1.000000 5.464000e+03 1.159000e+03 5.050000e+03 1.900000e+03 1.000000 1.000000 8.020000e+03 1.050000e+02 570.000000 400.000000 1.000000e+00 1.401068e+06 1.000000e+00 1.000000e+00 1.000000e+00 131.250000 110000.000000 2.128100e+04 320.000000 1.000000e+00 1000.000000 176.000000 617.000000

Data Cleaning¶

Finally, to deal with potential NaN values in the data of all kinds, we conduct data cleaning approaches shown as follows.

In [201]:
# 1. Check for missing values
print("Missing values per column:")
print(Sephora_df.isnull().sum())
Missing values per column:
author_id                  6143
rating                     6143
is_recommended           174131
helpfulness              567735
total_feedback_count       6143
                         ...   
tertiary_category        161894
value_price_usd         1069365
variation_desc          1091034
variation_type            52560
variation_value           64644
Length: 66, dtype: int64
In [202]:
# 2. Deal with missing values

missing_counts = Sephora_df.isna().sum()
threshold = 1
columns_to_drop = missing_counts[missing_counts > threshold].index
for col in columns_to_drop:
    print(f"{col}: {missing_counts[col]} missing values")
author_id: 6143 missing values
rating: 6143 missing values
is_recommended: 174131 missing values
helpfulness: 567735 missing values
total_feedback_count: 6143 missing values
total_neg_feedback_count: 6143 missing values
total_pos_feedback_count: 6143 missing values
submission_time: 6143 missing values
review_text: 7587 missing values
review_title: 316797 missing values
skin_tone: 176682 missing values
eye_color: 215771 missing values
skin_type: 117700 missing values
hair_color: 232911 missing values
product_name: 6143 missing values
brand_name: 6143 missing values
price_usd: 6143 missing values
pos_feedback_percentage: 567735 missing values
neg_feedback_percentage: 567735 missing values
child_max_price: 644796 missing values
child_min_price: 644796 missing values
highlights: 115729 missing values
ingredients: 22843 missing values
price_per_ml: 246147 missing values
price_per_oz: 108257 missing values
reviews: 278 missing values
sale_price_usd: 1090576 missing values
secondary_category: 8 missing values
size: 44661 missing values
size_ml: 246147 missing values
size_oz: 108257 missing values
tertiary_category: 161894 missing values
value_price_usd: 1069365 missing values
variation_desc: 1091034 missing values
variation_type: 52560 missing values
variation_value: 64644 missing values

Considering product name the unique identifier without which analysis becomes meaningless, we choose to drop the rows where product_name is missing in the first place.

In [203]:
# Drop rows with no product information (NaN in column 'product_name')
Sephora_df = Sephora_df.dropna(subset=['product_name'])

After that, we fill up the NaN values following certain rules correspondingly. The details are displayed below in the code block.

In [204]:
# 1. Handle feedback-related columns
Sephora_df = Sephora_df.assign(
    helpfulness=Sephora_df['helpfulness'].fillna(0),
    pos_feedback_percentage=Sephora_df['pos_feedback_percentage'].fillna(1),
    neg_feedback_percentage=Sephora_df['neg_feedback_percentage'].fillna(0),
    is_recommended=Sephora_df['is_recommended'].fillna(1)
)

# 2. Handle price-related columns by filling with price_usd
Sephora_df = Sephora_df.assign(
    child_max_price=Sephora_df['child_max_price'].fillna(Sephora_df['price_usd']),
    child_min_price=Sephora_df['child_min_price'].fillna(Sephora_df['price_usd']),
    sale_price_usd=Sephora_df['sale_price_usd'].fillna(Sephora_df['price_usd']),
    value_price_usd=Sephora_df['value_price_usd'].fillna(Sephora_df['price_usd'])
)

# 3. Handle variation-related columns
Sephora_df = Sephora_df.assign(
    variation_desc=Sephora_df['variation_desc'].fillna('No variation'),
    variation_type=Sephora_df['variation_type'].fillna('Unknown'),
    variation_value=Sephora_df['variation_value'].fillna('Unknown')
)

# 4. Handle product and review metadata
Sephora_df = Sephora_df.assign(
    review_text=Sephora_df['review_text'].fillna('No review provided'),
    review_title=Sephora_df['review_title'].fillna('No title'),
    tertiary_category=Sephora_df['tertiary_category'].fillna('Uncategorized')
)

# 5. Handle skin, hair, and eye attributes
Sephora_df = Sephora_df.assign(
    skin_tone=Sephora_df['skin_tone'].fillna('Not specified'),
    eye_color=Sephora_df['eye_color'].fillna('Not specified'),
    skin_type=Sephora_df['skin_type'].fillna('Not specified'),
    hair_color=Sephora_df['hair_color'].fillna('Not specified')
)

# 6. Handle ingredient and highlight columns
Sephora_df = Sephora_df.assign(
    ingredients=Sephora_df['ingredients'].fillna('Ingredients not listed'),
    highlights=Sephora_df['highlights'].fillna('No highlights available')
)

# 7. Handle size-related columns
Sephora_df = Sephora_df.assign(
    size=Sephora_df['size'].fillna('Unknown size'),
    size_ml=Sephora_df['size_ml'].fillna(Sephora_df['size_ml'].median()),
    size_oz=Sephora_df['size_oz'].fillna(Sephora_df['size_oz'].median())
)

# 8. Handle price per unit columns
Sephora_df = Sephora_df.assign(
    price_per_ml=Sephora_df['price_per_ml'].fillna(Sephora_df['price_per_ml'].median()),
    price_per_oz=Sephora_df['price_per_oz'].fillna(Sephora_df['price_per_oz'].median())
)

# 9. Print the result to confirm the missing values have been handled
missing_counts_after = Sephora_df.isna().sum()
print("Missing values after processing:")
print(missing_counts_after[missing_counts_after > 0])
Missing values after processing:
Series([], dtype: int64)
In [205]:
numerical_cols = Sephora_df.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = Sephora_df.select_dtypes(include=['object']).columns
In [206]:
# 3. Strip whitespace from string columns (if any)
Sephora_df[categorical_cols] = Sephora_df[categorical_cols].apply(lambda x: x.str.strip())

Part 2¶

In Project Part 2, we will focus on applying statistical and graphical analyses to extract insights and address the research questions we posed.

For review, this is our research questions:

  1. What factors drive higher product ratings? Why do certain products underperform?
  2. How does customer sentiment change over time?
  3. Are certain ingredients linked to positive or negative customer experiences?
  4. How certain ingredients link to effectiveness of customers with different skin types?

Research Problem 1: What factors drive higher product ratings? Why do certain products underperform?¶

For research problem 1, our primary interest lies in understanding how product-related factors influence average ratings of the product. Therefore we will mainly focus on product_df. The combined dataset will be used in research problem 2 & research problem 3 & research problem 4.

Supplement Project Part 1: Basic Analysis¶

This part we will supplement some basic analysis of

In [207]:
product_df.columns
Out[207]:
Index(['product_id', 'product_name', 'brand_id', 'brand_name', 'loves_count',
       'rating', 'reviews', 'size', 'variation_type', 'variation_value',
       'variation_desc', 'ingredients', 'price_usd', 'value_price_usd',
       'sale_price_usd', 'limited_edition', 'new', 'online_only',
       'out_of_stock', 'sephora_exclusive', 'highlights', 'primary_category',
       'secondary_category', 'tertiary_category', 'child_count',
       'child_max_price', 'child_min_price', 'size_oz', 'size_ml',
       'price_per_oz', 'price_per_ml', 'contains_Glycerin',
       'contains_Phenoxyethanol', 'contains_Caprylyl Glycol',
       'contains_Tocopherol', 'contains_Limonene',
       'contains_Ethylhexylglycerin', 'contains_Dimethicone',
       'contains_Linalool', 'contains_Silica', 'contains_Butylene Glycol',
       'contains_Citric Acid', 'contains_Mica', 'contains_Potassium Sorbate',
       'contains_Water', 'contains_Caprylic/Capric Triglyceride',
       'contains_Sodium Benzoate', 'contains_Citronellol',
       'contains_Tocopheryl Acetate', 'contains_Propanediol',
       'contains_Geraniol'],
      dtype='object')

To analyze the relationship between product average ratings and related numerical values (such as Price_usd, Price_per_oz, Size_oz, Loves_count, and Reviews), we can calculate the correlation between rating and these numerical features in the product_df DataFrame. This will help us identify which factors are most strongly associated with ratings.

In [208]:
correlation_features = ['rating', 'price_usd', 'price_per_oz', 'size_oz', 'loves_count', 'reviews']

correlation_data = product_df[correlation_features]
In [209]:
corr_matrix = correlation_data.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()
In [210]:
# Relationship between rating and top 10 ingredients
correlation_features = ['rating', 'contains_Glycerin',
       'contains_Phenoxyethanol', 'contains_Caprylyl Glycol',
       'contains_Tocopherol', 'contains_Limonene',
       'contains_Ethylhexylglycerin', 'contains_Dimethicone',
       'contains_Linalool', 'contains_Silica', 'contains_Butylene Glycol']
correlation_data = product_df[correlation_features]
In [211]:
corr_matrix = correlation_data.corr()

plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()

From the results above, we can observe that there is no significant relationship between loves_count, reviews, ingredients, and rating.

However, we will analyze the numerical variables in more detail in the next several sections, and also analyze the categorical variables using the Comparative & Multivariate Analysis.

Descriptive Statistics¶

  • Objective: Use descriptive statistics to explore the central tendency and dispersion of rating data to identify primary drivers of product ratings.
  • Method: Calculate the mean, median, mode, range, variance, and standard deviation of product ratings and other factors.
  • Expected Outcome: By analyzing the mean and median, we can assess rating distribution. Categories with higher standard deviations might indicate a broader range of factors affecting performance.
In [212]:
# Use ChatGPT to generate a summary of the dataset

# Descriptive statistics for the 'rating' column and other relevant columns in product_df
rating_data = product_df['rating']  # Replace 'rating' with actual column name if different

# Calculating descriptive statistics
mean_rating = rating_data.mean()
median_rating = rating_data.median()
mode_rating = rating_data.mode().iloc[0]  # Mode may return multiple values; take the first mode
range_rating = rating_data.max() - rating_data.min()
variance_rating = rating_data.var()
std_dev_rating = rating_data.std()

# Print results
print("Descriptive Statistics for Ratings:")
print(f"Mean: {mean_rating}")
print(f"Median: {median_rating}")
print(f"Mode: {mode_rating}")
print(f"Range: {range_rating}")
print(f"Variance: {variance_rating}")
print(f"Standard Deviation: {std_dev_rating}")
Descriptive Statistics for Ratings:
Mean: 4.194512889483933
Median: 4.289350000000001
Mode: 5.0
Range: 4.0
Variance: 0.26697307784114505
Standard Deviation: 0.5166943756623881

Conclusions:

  1. Mean and Median:

    • The mean (4.19) and median (4.29) reflect generally positive product ratings, with a skew toward higher values, indicating strong overall customer satisfaction.
  2. Mode:

    • A mode of 5.0 suggests that a significant number of products received top ratings.
  3. Range:

    • Average product ratings span from 1.0 to 5.0, indicating differences in customer satisfaction.
  4. Variance and Standard Deviation:

    • The moderate variance (0.27) and standard deviation (0.52) indicate some variation in ratings, showing that while most ratings are clustered, there is notable diversity.
In [213]:
# Code generated by ChatGPT

# If you want to calculate these metrics for additional columns (e.g., 'price', 'review_count', etc.), you can loop through them:
relevant_columns = ['price_usd', 'price_per_oz', 'size_oz', 'loves_count', 'reviews']  # Replace with actual column names

# Loop through relevant columns and print statistics
for column in relevant_columns:
    print(f"\nDescriptive Statistics for {column.capitalize()}:")
    mean = product_df[column].mean()
    median = product_df[column].median()
    mode = product_df[column].mode().iloc[0]  # Take the first mode if multiple
    value_range = product_df[column].max() - product_df[column].min()
    variance = product_df[column].var()
    std_dev = product_df[column].std()
    
    print(f"Mean: {mean}")
    print(f"Median: {median}")
    print(f"Mode: {mode}")
    print(f"Range: {value_range}")
    print(f"Variance: {variance}")
    print(f"Standard Deviation: {std_dev}")
Descriptive Statistics for Price_usd:
Mean: 51.65559453732046
Median: 35.0
Mode: 30.0
Range: 1897.0
Variance: 2880.386721882422
Standard Deviation: 53.669234407455654

Descriptive Statistics for Price_per_oz:
Mean: 160.71142475513528
Median: 32.35294117647059
Mode: 10.0
Range: 109999.86931818182
Variance: 2595230.6017293455
Standard Deviation: 1610.971943185028

Descriptive Statistics for Size_oz:
Mean: 3.6344106906629747
Median: 1.7
Mode: 1.7
Range: 175.9998
Variance: 45.44301872925057
Standard Deviation: 6.741143725604029

Descriptive Statistics for Loves_count:
Mean: 29179.56592889098
Median: 9880.0
Mode: 0
Range: 1401068
Variance: 4368168668.437603
Standard Deviation: 66092.12258989419

Descriptive Statistics for Reviews:
Mean: 448.54552093476144
Median: 122.0
Mode: 2.0
Range: 21280.0
Variance: 1214365.4939714111
Standard Deviation: 1101.9825288866475

Conclusions and Why they are considered as related factors:

  1. Price_usd:

    • Positive skew with a mean of $51.66 and a high variance suggests a range of product pricing. It will potentially impact ratings if high-priced items don't meet expectations.
  2. Price_per_oz:

    • The high mean of $160.71 with a large standard deviation implies a few very expensive products that may set variable customer expectations and influence ratings.
  3. Size_oz:

    • Mean of 3.63 oz and lower variance suggest that most products are small, which could indirectly impact ratings if larger sizes are perceived as better value.
  4. Loves_count:

    • High variance and mean of 29,179 suggest popularity varies widely. Higher "Loves" counts may correlate with higher ratings, reflecting customer satisfaction.
  5. Reviews:

    • Positive skew with a mean of 448.55 and high variance indicates some products have more customer engagement, which could stabilize average ratings.

Inferential Statistics¶

  • Objective: Validate which variables (such as price, product type, or ingredients) significantly influence product ratings.
  • Method:
    • Regression Analysis: Evaluate relationships between product features (e.g., price, number of ingredients) and ratings.
    • ANOVA (Analysis of Variance): Test for significant differences in ratings across different product categories.
  • Expected Outcome: Determine whether certain product features have a significant effect on ratings, explaining why some products underperform.
Regression Analysis¶

This is for testing correlation with continuous variables like Price_usd

Steps:

  1. Define Hypotheses:

    • Null Hypothesis (H₀): There is no significant relationship between product ratings and price (Price_usd).
    • Alternative Hypothesis (H₁): There is a significant relationship between product ratings and price.
  2. Conduct Regression:

    • Perform a simple linear regression with rating as the dependent variable and Price_usd as the independent variable.
    • If you have multiple factors (like Price_usd, Price_per_oz, etc.), use multiple linear regression.
  3. Choose Significance Level:

    • Typically, we use a significance level of α = 0.05.
  4. Interpret Results:

    • If the p-value for Price_usd (or other factors) is below 0.05, reject the null hypothesis, suggesting a significant relationship between rating and price.
    • Check the coefficient to see if the relationship is positive or negative.
In [214]:
import statsmodels.api as sm
# GPT facilitated the generation of the following code

# Define the independent and dependent variables
X = product_df[['price_usd', 'price_per_oz', 'size_oz', 'loves_count', 'reviews']]
X = sm.add_constant(X)  # Adds a constant term to the model
y = product_df['rating']

# Drop rows with NaN or Inf values
X = X.replace([float('inf'), float('-inf')], float('nan')).dropna()
y = y.loc[X.index]  # Ensure y matches the filtered X index

# Fit the regression model
model = sm.OLS(y, X).fit()

# Print the summary of regression results
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 rating   R-squared:                       0.016
Model:                            OLS   Adj. R-squared:                  0.015
Method:                 Least Squares   F-statistic:                     20.15
Date:                Wed, 06 Nov 2024   Prob (F-statistic):           5.28e-20
Time:                        12:39:20   Log-Likelihood:                -4163.9
No. Observations:                6341   AIC:                             8340.
Df Residuals:                    6335   BIC:                             8380.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            4.1332      0.010    408.638      0.000       4.113       4.153
price_usd        0.0010      0.000      7.844      0.000       0.001       0.001
price_per_oz -5.617e-06    3.6e-06     -1.560      0.119   -1.27e-05    1.44e-06
size_oz          0.0024      0.001      2.658      0.008       0.001       0.004
loves_count  -1.248e-08   1.11e-07     -0.112      0.911   -2.31e-07    2.06e-07
reviews        3.13e-05   6.74e-06      4.644      0.000    1.81e-05    4.45e-05
==============================================================================
Omnibus:                     1746.116   Durbin-Watson:                   1.609
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             6024.500
Skew:                          -1.366   Prob(JB):                         0.00
Kurtosis:                       6.917   Cond. No.                     1.36e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Regression Results Analysis:

  1. Model Fit:

    • R-squared of 0.016 shows that only 1.6% of the rating variation is explained, indicating low predictive power for these features.
  2. Model Significance:

    • F-statistic p-value is significant (<0.05), but low R-squared suggests limited explanatory power.
  3. Coefficients:

    • Significant predictors:
      • price_usd (p = 0.000): Small positive effect on ratings.
      • size_oz (p = 0.008): Slight positive association with ratings.
      • reviews (p = 0.000): Higher reviews linked to slightly higher ratings.
    • Non-significant predictors:
      • price_per_oz and loves_count have no significant effect.
  4. Multicollinearity:

    • High condition number suggests potential multicollinearity, indicating interdependence among features.

Summary: While price_usd, size_oz, and reviews show a weak positive relationship with ratings, the low R-squared indicates these factors alone don’t strongly predict ratings. Further exploration with additional features is recommended.

ANOVA¶

This is for testing differences between categories, like product type, etc.

Steps:

  1. Define Hypotheses:

    • Null Hypothesis (H₀): There is no significant difference in ratings across different product types.
    • Alternative Hypothesis (H₁): There is a significant difference in ratings across different product types.
  2. Conduct ANOVA:

    • Use ANOVA if product_type (or any other categorical variable) is relevant. ANOVA helps test if the means of rating vary significantly across product types.
  3. Choose Significance Level:

    • Use a standard α = 0.05.
  4. Interpret Results:

    • If the ANOVA p-value is less than 0.05, reject the null hypothesis, indicating a significant difference in ratings across product types.
In [215]:
pd.set_option('display.max_columns', None)
product_df.head()
Out[215]:
product_id product_name brand_id brand_name loves_count rating reviews size variation_type variation_value variation_desc ingredients price_usd value_price_usd sale_price_usd limited_edition new online_only out_of_stock sephora_exclusive highlights primary_category secondary_category tertiary_category child_count child_max_price child_min_price size_oz size_ml price_per_oz price_per_ml contains_Glycerin contains_Phenoxyethanol contains_Caprylyl Glycol contains_Tocopherol contains_Limonene contains_Ethylhexylglycerin contains_Dimethicone contains_Linalool contains_Silica contains_Butylene Glycol contains_Citric Acid contains_Mica contains_Potassium Sorbate contains_Water contains_Caprylic/Capric Triglyceride contains_Sodium Benzoate contains_Citronellol contains_Tocopheryl Acetate contains_Propanediol contains_Geraniol
0 P473671 Fragrance Discovery Set 6342 19-69 6320 3.6364 11.0 NaN NaN NaN NaN [Capri Eau de Parfum:, Alcohol Denat. (SD Alco... 35.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Warm &Spicy Scen... Fragrance Value & Gift Sets Perfume Gift Sets 0 NaN NaN NaN NaN NaN NaN False False False False True False False True False False False False False False False False True False False True
1 P473668 La Habana Eau de Parfum 6342 19-69 3827 4.1538 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 85.0 30.0 3.4 100.0 57.352941 1.95 False False False False False False False False False False False False False False False False False False False False
2 P473662 Rainbow Bar Eau de Parfum 6342 19-69 3253 4.2500 16.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0 3.4 100.0 57.352941 1.95 False False False False False False False True False False False False False False False False False False False False
3 P473660 Kasbah Eau de Parfum 6342 19-69 3018 4.4762 21.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0 3.4 100.0 57.352941 1.95 False False False False False False False True False False False False False False False False True False False True
4 P473658 Purple Haze Eau de Parfum 6342 19-69 2691 3.2308 13.0 3.4 oz/ 100 mL Size + Concentration + Formulation 3.4 oz/ 100 mL NaN [Alcohol Denat. (SD Alcohol 39C), Parfum (Frag... 195.0 NaN NaN 0 0 1 0 0 ['Unisex/ Genderless Scent', 'Layerable Scent'... Fragrance Women Perfume 2 75.0 30.0 3.4 100.0 57.352941 1.95 False False False False False False False True False False False False False False False False False False False True
In [216]:
from scipy import stats
# GPT facilitated the generation of the following code

# Check if 'new' and 'rating' columns exist and handle missing values in 'rating'
if 'new' in product_df.columns and 'rating' in product_df.columns:
    # Remove NaN values from 'rating' for each group
    group_1 = product_df[product_df['new'] == 1]['rating'].dropna()
    group_0 = product_df[product_df['new'] == 0]['rating'].dropna()
    
    # Ensure both groups have data before running ANOVA
    if len(group_1) > 0 and len(group_0) > 0:
        # Conduct ANOVA
        anova_result = stats.f_oneway(group_1, group_0)
        print("ANOVA Result:", anova_result)
    else:
        print("One of the groups has no data.")
else:
    print("Error: 'new' or 'rating' column not found in product_df.")
ANOVA Result: F_onewayResult(statistic=86.50109826723633, pvalue=1.7601462104023535e-20)

Conclusion:

There is a significant difference (pvalue=1.760e-20) in average product ratings between new and non-new products.

In [217]:
# GPT facilitated the generation of the following code

# Check if 'online_only' and 'rating' columns exist and handle missing values in 'rating'
if 'new' in product_df.columns and 'rating' in product_df.columns:
    # Remove NaN values from 'rating' for each group
    group_1 = product_df[product_df['online_only'] == 1]['rating'].dropna()
    group_0 = product_df[product_df['online_only'] == 0]['rating'].dropna()
    
    # Ensure both groups have data before running ANOVA
    if len(group_1) > 0 and len(group_0) > 0:
        # Conduct ANOVA
        anova_result = stats.f_oneway(group_1, group_0)
        print("ANOVA Result:", anova_result)
    else:
        print("One of the groups has no data.")
else:
    print("Error: 'online_only' or 'rating' column not found in product_df.")
ANOVA Result: F_onewayResult(statistic=3.9889151861260834, pvalue=0.045833473720411584)

Conclusion:

There is a significant difference (pvalue=0.0458) in average product ratings between online_only and non-online_only products.

In [218]:
# GPT facilitated the generation of the following code

# Check if 'limited_edition' and 'rating' columns exist and handle missing values in 'rating'
if 'new' in product_df.columns and 'rating' in product_df.columns:
    # Remove NaN values from 'rating' for each group
    group_1 = product_df[product_df['limited_edition'] == 1]['rating'].dropna()
    group_0 = product_df[product_df['limited_edition'] == 0]['rating'].dropna()
    
    # Ensure both groups have data before running ANOVA
    if len(group_1) > 0 and len(group_0) > 0:
        # Conduct ANOVA
        anova_result = stats.f_oneway(group_1, group_0)
        print("ANOVA Result:", anova_result)
    else:
        print("One of the groups has no data.")
else:
    print("Error: 'limited_edition' or 'rating' column not found in product_df.")
ANOVA Result: F_onewayResult(statistic=21.401836958075258, pvalue=3.781165073111685e-06)

Conclusion:

There is a significant difference (pvalue=3.7811e-06) in average product ratings between limited_edition and non-limited_edition products.

In [219]:
# Code generated by ChatGPT

# Automatically find all columns that start with "contains_"
columns = [col for col in product_df.columns if col.startswith("contains_")]

# Dictionary to store ANOVA results
anova_results = {}

# Loop through each ingredient column and perform ANOVA
for col in columns:
    # Check if 'rating' exists in product_df
    if 'rating' in product_df.columns:
        # Remove NaN values from 'rating' for each group
        group_1 = product_df[product_df[col] == 1]['rating'].dropna()
        group_0 = product_df[product_df[col] == 0]['rating'].dropna()

        # Ensure both groups have data before running ANOVA
        if len(group_1) > 0 and len(group_0) > 0:
            # Conduct ANOVA
            anova_result = stats.f_oneway(group_1, group_0)
            anova_results[col] = anova_result.pvalue  # Store p-value
        else:
            print(f"Insufficient data for {col} (one of the groups has no data).")
    else:
        print("Error: 'rating' column not found in product_df.")
        break  # Exit the loop if 'rating' column is missing

# Display only significant results
significant_results = {col: p for col, p in anova_results.items() if p < 0.05}
print("\nSignificant ANOVA Results (p < 0.05):")
for col, p in significant_results.items():
    print(f"{col}: p-value = {p}")
Significant ANOVA Results (p < 0.05):
contains_Phenoxyethanol: p-value = 1.3530427418941688e-06
contains_Tocopherol: p-value = 0.005581632113425609
contains_Limonene: p-value = 9.816066072505535e-13
contains_Linalool: p-value = 4.1893522770234076e-09
contains_Silica: p-value = 0.0003927895758699168
contains_Butylene Glycol: p-value = 0.004688569041837416
contains_Water: p-value = 7.234178076608262e-05
contains_Caprylic/Capric Triglyceride: p-value = 0.005249331521078461
contains_Citronellol: p-value = 1.0400385863489007e-08
contains_Tocopheryl Acetate: p-value = 0.003148642697125757
contains_Geraniol: p-value = 1.3540721919634177e-08

Conclusion:

There is a significant difference in average product ratings between products with and without the following ingredients:

  • Phenoxyethanol
  • Tocopherol
  • Limonene
  • Linalool
  • Silica
  • Butylene Glycol
  • Water
  • Caprylic/Capric Triglyceride
  • Citronellol
  • Tocopheryl Acetate
  • Geraniol

Graphical Analysis¶

  • Objective: Visually represent rating distributions and correlations to uncover potential influencing factors.
  • Method: Use violin plots and other visualizations to observe rating distributions across different product categories, and scatter plots and other plots to explore relationships between ratings and factors such as price.
  • Expected Outcome: Identify which features are associated with higher or lower ratings and understand their impact.
Rating vs Boolean Features¶

Boolean features: 'limited_edition', 'new', 'online_only', 'contains_Phenoxyethanol', 'contains_Tocopherol', 'contains_Limonene', 'contains_Linalool', 'contains_Silica', 'contains_Butylene Glycol', 'contains_Water', 'contains_Caprylic/Capric Triglyceride', 'contains_Citronellol', 'contains_Tocopheryl Acetate', 'contains_Geraniol'

In [220]:
import seaborn as sns
import matplotlib.pyplot as plt

# Code generated by ChatGPT

# List of boolean features to plot
boolean_features = [
    'limited_edition', 'new', 'online_only', 'contains_Phenoxyethanol', 
    'contains_Tocopherol', 'contains_Limonene', 'contains_Linalool', 
    'contains_Silica', 'contains_Butylene Glycol', 'contains_Water', 
    'contains_Caprylic/Capric Triglyceride', 'contains_Citronellol', 
    'contains_Tocopheryl Acetate', 'contains_Geraniol'
]

# Set up the plot style and size
sns.set(style="whitegrid")
num_features = len(boolean_features)
num_cols = 2  # Number of columns for the plots
num_rows = (num_features + 1) // num_cols  # Calculate required rows

plt.figure(figsize=(15, num_rows * 5))  # Adjust figure size

# Loop through each boolean feature and create a violin plot
for i, feature in enumerate(boolean_features, 1):
    plt.subplot(num_rows, num_cols, i)
    sns.violinplot(x=feature, y='rating', data=product_df, hue=feature, palette="muted", legend=False)
    plt.title(f"Rating Distribution by {feature}")
    plt.xlabel(feature)
    plt.ylabel("Rating")

# Adjust layout for better display
plt.tight_layout()
plt.show()
Price v.s. Rating¶

To enhance clarity and organization, we use a hexbin plot instead of a scatter plot.

In [221]:
# Hexbin Plot (Rating vs Price)
# Code generated by ChatGPT

# Filter products with price <= 350 USD
filtered_df = product_df[product_df['price_usd'] <= 350]

plt.figure(figsize=(8, 6))
plt.hexbin(filtered_df['price_usd'], filtered_df['rating'], gridsize=30, 
           cmap='viridis', mincnt=1)

plt.colorbar(label='Number of Products')
plt.xlabel('Price (USD)')
plt.ylabel('Rating')
plt.title('Hexbin Plot of Rating vs Price (<= 350 USD)')
plt.show()
In [222]:
# Hexbin Plot (Rating vs Unit Price)
# Filter products with unit price <= 600 USD
filtered_df = product_df[product_df['price_per_oz'] <= 600]

plt.figure(figsize=(8, 6))
plt.hexbin(filtered_df['price_per_oz'], filtered_df['rating'], gridsize=30, 
           cmap='viridis', mincnt=1)

plt.colorbar(label='Number of Products')
plt.xlabel('Price (USD)')
plt.ylabel('Rating')
plt.title('Hexbin Plot of Rating vs Price (<= 350 USD)')
plt.show()
In [223]:
# Code generated by ChatGPT with my prompt
bins = [0, 10, 20, 50, 100, 200, 500, product_df['price_usd'].max()]
labels = ['0-10', '10-20', '20-50', '50-100', '100-200', '200-500', f'500+']

product_df['price_bin'] = pd.cut(product_df['price_usd'], bins=bins, labels=labels, include_lowest=True)

avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()

plt.figure(figsize=(6, 4))
plt.bar(avg_rating_by_bin['price_bin'], avg_rating_by_bin['rating'])
plt.title('Average Rating by Price Bin', fontsize=16)
plt.xlabel('Price Range (USD)', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)

plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\3981940082.py:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()
In [224]:
# Code generated by ChatGPT with my prompt
bins = [0, 50, 100, 200, 500, product_df['price_per_oz'].max()]
labels = ['0-50', '50-100', '100-200', '200-500', f'500+']

product_df['price_bin'] = pd.cut(product_df['price_per_oz'], bins=bins, labels=labels, include_lowest=True)

avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()

plt.figure(figsize=(6, 4))
plt.bar(avg_rating_by_bin['price_bin'], avg_rating_by_bin['rating'])
plt.title('Average Rating by Price Bin', fontsize=16)
plt.xlabel('Price Range (USD)', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)

plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\63299085.py:7: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  avg_rating_by_bin = product_df.groupby('price_bin')['rating'].mean().reset_index()

Analysis:

From the figure above, we can observe slight differences between total price and average rating. However, no noticeable differences are found between unit price and average rating.

Additionally, if we aim to achieve higher ratings, setting the price within the 20-200 range may be effective.

Comparative Analysis¶

  • Objective: Compare ratings across different product categories or other subsets to explore variations.
  • Method: Compare ratings by category to identify the performance of specific groups.
  • Expected Outcome: Identify categories with notably higher or lower ratings, explaining which factors might drive ratings up or down.
Category v.s. Rating¶
In [225]:
# Primary Category vs Average Rating
product_df['rating'] = pd.to_numeric(product_df['rating'], errors='coerce')

avg_rating_by_category = product_df.groupby('primary_category')['rating'].mean().\
    reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['primary_category', 'average_rating']
print(avg_rating_by_category)
  primary_category  average_rating
2            Gifts        4.563450
5              Men        4.504992
8  Tools & Brushes        4.271458
1        Fragrance        4.230889
7         Skincare        4.228890
3             Hair        4.201113
0      Bath & Body        4.195015
4           Makeup        4.146845
6        Mini Size        4.005665
In [226]:
# Secondary Category vs Average Rating
avg_rating_by_category = product_df.groupby('secondary_category')['rating'].mean().\
    reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['secondary_category', 'average_rating']
print(avg_rating_by_category)
           secondary_category  average_rating
29                Other Needs        4.682980
32                    Shaving        4.641243
25                        Men        4.548450
23            Makeup Palettes        4.513805
18                 Hair Tools        4.496691
4          Beauty Supplements        4.421550
1                 Bath & Body        4.377929
34                   Skincare        4.353302
10                      Cheek        4.323424
3          Beauty Accessories        4.322500
11                  Cleansers        4.314373
37                 Treatments        4.306111
8       Brushes & Applicators        4.305875
27               Moisturizers        4.277424
24                      Masks        4.259330
17  Hair Styling & Treatments        4.244675
7           Body Moisturizers        4.237487
40                      Women        4.233255
31      Shampoo & Conditioner        4.214010
20                        Lip        4.209429
2               Bath & Shower        4.196601
6                   Body Care        4.164788
21     Lip Balms & Treatments        4.149557
38          Value & Gift Sets        4.144749
19            High Tech Tools        4.142364
14                       Face        4.139327
35                  Sunscreen        4.125169
39                   Wellness        4.123166
13                   Eye Care        4.111931
36                      Tools        4.066703
26                  Mini Size        4.066505
0                 Accessories        4.052295
12                        Eye        4.050572
5                Beauty Tools        4.033523
9       Candles & Home Scents        4.021335
30               Self Tanners        3.944451
22                     Makeup        3.926427
16                       Hair        3.898375
33            Shop by Concern        3.895000
28                       Nail        3.878163
15                  Fragrance        3.865240
In [227]:
# Secondary Category vs Average Rating
avg_rating_by_category = product_df.groupby('tertiary_category')['rating'].mean().\
    reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['tertiary_category', 'average_rating']
print(avg_rating_by_category)
              tertiary_category  average_rating
82                  Lip Brushes        4.711833
108                     Shaving        4.709130
91   Makeup Bags & Travel Cases        4.690660
12                Body Products        4.682980
54                    Face Wash        4.624411
..                          ...             ...
32   Deodorant & Antiperspirant        3.764542
5      Bath Soaks & Bubble Bath        3.705443
30                 Damaged Hair        3.696500
3                 BB & CC Cream        3.669130
93    Manicure & Pedicure Tools        2.285700

[118 rows x 2 columns]

Analysis:

From the results, we can observe the following trends:

Primary Category:

  • Top performers: "Gifts" (4.56) and "Men" (4.50) have the highest average ratings, indicating that these product categories are highly appreciated by customers. This may suggest that these products meet customer expectations well, potentially due to their quality, usability, or alignment with specific customer needs.
  • Lower performers: "Mini Size" (4.00) and "Makeup" (4.14) receive comparatively lower ratings, suggesting room for improvement. It’s possible that mini-sized products offer less value or that makeup items may not fully meet customer expectations in terms of variety, durability, or performance.

Secondary Category:

  • Highest rated: "Other Needs" (4.68) and "Shaving" (4.64) receive the highest scores, indicating strong customer satisfaction. Shaving products might perform well due to their practical utility, while "Other Needs" likely covers niche items that satisfy specific demands.
  • Lower ratings: Categories like "Nail" (3.88) and "Fragrance" (3.87) are rated lower, possibly due to limited product effectiveness, insufficient variety, or unmet customer expectations. The subjective nature of fragrances and varying preferences in nail care products could contribute to these lower ratings.

Tertiary Category:

  • Top categories: "Lip Brushes" (4.71) and "Shaving" (4.71) are among the highest-rated, suggesting these niche products perform well. This could imply that customers appreciate specialized tools that enhance their beauty routines or provide value through functionality.
  • Lowest performers: "Manicure & Pedicure Tools" (2.29) and "BB & CC Cream" (3.67) have the lowest average ratings, indicating dissatisfaction or unmet expectations. This could result from poor product quality, lack of innovation, or insufficient effectiveness compared to competitors.

These results highlight areas where certain products excel and others fall short, suggesting that focusing on customer feedback and product refinement in the lower-performing categories could enhance overall satisfaction.

Note: GPT facilitate writing.

Brand vs Average Rating¶
In [228]:
# brand_name vs Average Rating
avg_rating_by_category = product_df.groupby('brand_name')['rating'].mean().\
    reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['brand_name', 'average_rating']
print(avg_rating_by_category)
           brand_name  average_rating
78        Erno Laszlo        5.000000
10              Aquis        4.904800
158   MACRENE actives        4.889420
161              MARA        4.823860
32             CANOPY        4.813733
..                ...             ...
47   Christophe Robin        3.154767
101    Good Dye Young        3.062050
194           Overose        2.848450
253         The Maker             NaN
299        philosophy             NaN

[304 rows x 2 columns]
In [229]:
low_perform_brand = avg_rating_by_category['brand_name'].tail(35)

Analysis:

Top Brands:

  • Erno Laszlo (5.00), Aquis (4.90), MACRENE actives (4.89), and MARA (4.82) show high customer satisfaction, likely due to consistent quality and meeting customer needs.

Lower-Rated Brands:

  • Christophe Robin (3.15) and Good Dye Young (3.06) have lower ratings, suggesting issues with product quality or unmet expectations.

Lowest-Rated Brand:

  • Overose (2.85) indicates significant dissatisfaction, likely due to performance issues or poor customer experience.

High-performing brands should maintain their quality, while underperforming brands need to address product and customer experience issues.

Note: GPT facilitate writing.

Rating v.s. Online or not¶
In [230]:
# Online vs Average Rating
avg_rating_by_category = product_df.groupby('online_only')['rating'].mean().\
    reset_index().sort_values(by='rating', ascending=False)
avg_rating_by_category.columns = ['online_only', 'average_rating']
print(avg_rating_by_category)
   online_only  average_rating
0            0        4.200374
1            1        4.172408

Non-online products (4.20) have slightly higher ratings than online-only products (4.17).

Multivariate Analysis¶

  • Objective: Study the influence of multiple factors on ratings simultaneously.
  • Method: Use pivot tables to explore the relationship between online or not, categories, and ratings.
  • Expected Outcome: Understand how various factors interact to drive ratings higher or lower.
Rating v.s. Online or not & Category¶
In [231]:
# Code generated by ChatGPT with my prompt
# Calculate average rating by both primary category and online_only status
avg_rating_online_category = product_df.groupby(['primary_category', 'online_only'])['rating'].mean().reset_index()

# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='primary_category', columns='online_only', values='rating')

pivot_table.columns = ['Non-Online', 'Online']

# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']

pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print(pivot_table)
                  Non-Online    Online  Rating_Difference
primary_category                                         
Mini Size           3.998237  4.058100           0.059863
Skincare            4.218118  4.271438           0.053320
Tools & Brushes     4.271543  4.271013          -0.000530
Hair                4.205973  4.191046          -0.014927
Bath & Body         4.222076  4.151288          -0.070788
Fragrance           4.259239  4.144556          -0.114683
Men                 4.542989  4.382857          -0.160132
Makeup              4.164505  3.998730          -0.165774
Gifts               4.563450       NaN                NaN

Analysis:

Primary Category Comparison:

  1. Higher Online Ratings:

    • Mini Size: 4.06 vs. 3.99 (+0.06) – More convenient for online purchases.
    • Skincare: 4.27 vs. 4.22 (+0.05) – Benefits from detailed product information online.
  2. Minimal Difference:

    • Tools & Brushes: 4.27 vs. 4.27 (-0.0005) – Similar satisfaction across both channels.
  3. Higher Non-Online Ratings:

    • Makeup: 4.16 vs. 3.99 (-0.17) – In-person trials likely preferred.
    • Fragrance: 4.26 vs. 4.14 (-0.11) – Harder to evaluate online.
    • Men: 4.54 vs. 4.38 (-0.16) – Benefits from in-store accessibility and recommendations.

Conclusion:

  • Products like Mini Size and Skincare benefit from online availability, likely due to convenience and access to reviews.
  • Makeup and Fragrances perform better offline, suggesting that in-store experiences play a key role in customer satisfaction for these categories.
  • Distribution Strategy Implication: Focusing on enhancing online shopping experiences for Makeup and Fragrances (e.g., offering virtual try-ons) could help improve ratings.

Note: GPT facilitate writing.

In [232]:
# Calculate average rating by both secondary_category and online_only status
avg_rating_online_category = product_df.groupby(['secondary_category', 'online_only'])['rating'].mean().reset_index()

# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='secondary_category', columns='online_only', values='rating')

pivot_table.columns = ['Non-Online', 'Online']

# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']

pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print(pivot_table)
                           Non-Online    Online  Rating_Difference
secondary_category                                                
Beauty Tools                 3.993805  4.867600           0.873795
Bath & Body                  4.185900  4.633967           0.448067
Fragrance                    3.772564  4.120100           0.347536
Sunscreen                    4.091897  4.282774           0.190877
Body Care                    4.059471  4.240400           0.180929
Wellness                     4.081017  4.214208           0.133191
Moisturizers                 4.252550  4.374208           0.121658
Masks                        4.243141  4.364555           0.121413
Self Tanners                 3.920877  4.019887           0.099010
Cleansers                    4.300014  4.392433           0.092419
Value & Gift Sets            4.118418  4.204760           0.086341
Eye Care                     4.103188  4.155081           0.051892
Tools                        4.050158  4.087829           0.037671
Skincare                     4.347367  4.384560           0.037193
Hair Styling & Treatments    4.236636  4.266542           0.029906
Other Needs                  4.671633  4.700000           0.028367
Lip Balms & Treatments       4.144431  4.168485           0.024053
Mini Size                    4.066819  4.065684          -0.001135
Hair                         3.898549  3.897257          -0.001292
Treatments                   4.307763  4.298656          -0.009107
High Tech Tools              4.147954  4.137593          -0.010362
Cheek                        4.327744  4.295345          -0.032399
Shampoo & Conditioner        4.226891  4.184653          -0.042237
Women                        4.252738  4.174715          -0.078023
Face                         4.147996  4.067131          -0.080864
Hair Tools                   4.512100  4.427350          -0.084750
Body Moisturizers            4.277604  4.162533          -0.115071
Eye                          4.062154  3.898110          -0.164044
Candles & Home Scents        4.079350  3.907596          -0.171754
Nail                         3.946375  3.769025          -0.177350
Makeup                       3.940203  3.754230          -0.185973
Makeup Palettes              4.561207  4.371600          -0.189607
Bath & Shower                4.272463  4.067637          -0.204826
Lip                          4.236904  4.008690          -0.228214
Shaving                      4.677833  4.421700          -0.256133
Men                          4.576584  4.290915          -0.285669
Accessories                  4.124686  3.726538          -0.398149
Beauty Accessories           4.700000  4.228125          -0.471875
Brushes & Applicators        4.347053  3.800300          -0.546753
Beauty Supplements           4.421550       NaN                NaN
Shop by Concern              3.895000       NaN                NaN

Analysis:

Secondary Category Comparison:

  1. Higher Online Ratings:

    • Beauty Tools: 4.87 vs. 3.99 (+0.87) – Likely benefits from detailed reviews and product descriptions online.
    • Bath & Body: 4.63 vs. 4.19 (+0.45) – Products may appeal due to convenience and extensive options available online.
    • Fragrance: 4.12 vs. 3.77 (+0.35) – Increased online rating suggests success with detailed scent profiles and customer reviews online.
  2. Minimal Difference:

    • Skincare: 4.38 vs. 4.35 (+0.03) – Similar satisfaction across both channels, likely due to consistent product quality.
    • Hair Styling & Treatments: 4.27 vs. 4.24 (+0.03) – Indicates uniform customer perception both online and offline.
  3. Higher Non-Online Ratings:

    • Makeup: 3.94 vs. 3.75 (-0.19) – In-person product trials are likely preferred by customers.
    • Men: 4.58 vs. 4.29 (-0.29) – In-store experiences may better cater to the needs of this category.
    • Accessories: 4.12 vs. 3.73 (-0.40) – Physical presence seems to enhance the appeal, possibly due to tactile feedback.

Conclusion:

  • Categories like Beauty Tools and Bath & Body perform better online, indicating these products benefit from online descriptions and ease of access.
  • Makeup, Men’s products, and Accessories see better ratings offline, suggesting that in-store experiences enhance customer satisfaction.
  • Implication for Strategy: For categories like Makeup and Fragrances, enhancing online experiences (e.g., virtual try-ons, augmented scent descriptions) could bridge the satisfaction gap, while leveraging in-store strengths for tactile or try-before-you-buy items.

Note: GPT facilitate writing.

In [233]:
# Calculate average rating by both primary category and online_only status
avg_rating_online_category = product_df.groupby(['tertiary_category', 'online_only'])['rating'].mean().reset_index()

# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='tertiary_category', columns='online_only', values='rating')

# Rename columns for clarity
pivot_table.columns = ['Non-Online', 'Online']

# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']

# Drop rows with NaN values in 'Rating_Difference' or either 'Online'/'Non-Online' columns
pivot_table = pivot_table.dropna()

# Sort by rating difference and display the top and bottom rows
pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print("pivot_table top 10:")
print(pivot_table.head(10))
print("pivot_table last 10:")
print(pivot_table.tail(10))
pivot_table top 10:
                           Non-Online    Online  Rating_Difference
tertiary_category                                                 
Teeth Whitening              3.239400  4.462040           1.222640
Blotting Papers              3.823067  5.000000           1.176933
Cologne Gift Sets            4.187500  5.000000           0.812500
Curling Irons                3.864867  4.568389           0.703522
Hair Thinning & Hair Loss    3.917200  4.544000           0.626800
Tinted Moisturizer           4.012629  4.418500           0.405871
Sheet Masks                  4.080966  4.486100           0.405134
Eye Brushes                  4.476147  4.857100           0.380953
Concealer                    4.205592  4.541175           0.335583
Eye Masks                    3.991108  4.307500           0.316392
pivot_table last 10:
                            Non-Online    Online  Rating_Difference
tertiary_category                                                  
Brush Cleaners                4.390150  3.818067          -0.572083
Eyelash Curlers               4.001762  3.379800          -0.621962
Face Sets                     4.306533  3.653525          -0.653008
Hair Dye & Root Touch-Ups     4.311633  3.491200          -0.820433
False Eyelashes               4.092528  3.270025          -0.822503
Lip Gloss                     4.240900  3.411933          -0.828967
Sponges & Applicators         4.198200  3.275133          -0.923067
Bath Soaks & Bubble Bath      4.066560  2.802650          -1.263910
Makeup Bags & Travel Cases    4.946650  3.666700          -1.279950
Lip Plumper                   4.066493  1.000000          -3.066493

Analysis:

Tertiary Category Comparison:

  1. Higher Online Ratings:

    • Teeth Whitening: 4.46 vs. 3.24 (+1.22) – Likely benefits from detailed product information and customer reviews online.
    • Blotting Papers: 5.00 vs. 3.82 (+1.18) – Convenient for online purchases, where effectiveness can be easily conveyed.
    • Cologne Gift Sets: 5.00 vs. 4.19 (+0.81) – Online appeal likely enhanced by gift options and customer testimonials.
  2. Minimal Difference:

    • Eye Masks: 4.31 vs. 3.99 (+0.32) – Consistent quality perceived both online and offline.
    • Concealer: 4.54 vs. 4.21 (+0.33) – Reliable performance across channels, possibly due to clear expectations set online.
  3. Higher Non-Online Ratings:

    • Brush Cleaners: 3.82 vs. 4.39 (-0.57) – Customers may prefer seeing product demonstrations in-store.
    • Lip Gloss: 3.41 vs. 4.24 (-0.83) – In-store trials likely enhance appeal for color and texture evaluation.
    • Lip Plumper: 1.00 vs. 4.07 (-3.07) – Significantly lower online ratings, suggesting a need for tactile experience or more accurate product descriptions online.

Conclusion:

  • Categories like Teeth Whitening and Blotting Papers perform better online, benefitting from detailed information and ease of access.
  • Brush Cleaners and Lip Gloss are rated higher offline, indicating a preference for in-store interaction and product testing.
  • Implications: Enhancing product descriptions and offering virtual trials for color cosmetics (e.g., lip gloss) online may improve ratings, while products like Lip Plumper may require improved online transparency to match customer expectations.

Note: GPT facilitate writing.

Rating v.s. Online or not & Brand¶
In [234]:
# Calculate average rating by both primary category and online_only status
avg_rating_online_category = product_df.groupby(['brand_name', 'online_only'])['rating'].mean().reset_index()

# Pivot the data for better comparison
pivot_table = avg_rating_online_category.pivot(index='brand_name', columns='online_only', values='rating')

# Rename columns for clarity
pivot_table.columns = ['Non-Online', 'Online']

# Calculate the rating difference between online and non-online categories
pivot_table['Rating_Difference'] = pivot_table['Online'] - pivot_table['Non-Online']

# Drop rows with NaN values in 'Rating_Difference' or either 'Online'/'Non-Online' columns
pivot_table = pivot_table.dropna()

# Sort by rating difference and display the top and bottom rows
pivot_table = pivot_table.sort_values(by='Rating_Difference', ascending=False)
print("pivot_table top 10:")
print(pivot_table.head(10))
print("pivot_table last 10:")
print(pivot_table.tail(10))
pivot_table top 10:
                  Non-Online    Online  Rating_Difference
brand_name                                               
Christophe Robin    2.760600  3.943100           1.182500
ROSE Ingleton MD    3.595200  4.667067           1.071867
TWEEZERMAN          3.863853  4.800000           0.936147
CLEAN RESERVE       4.162917  5.000000           0.837083
BURBERRY            4.416850  5.000000           0.583150
COOLA               3.715157  4.257425           0.542268
HUM Nutrition       4.194833  4.713640           0.518807
Jo Malone London    3.859558  4.355029           0.495471
GLO Science         3.998700  4.462040           0.463340
Josie Maran         4.298509  4.751800           0.453291
pivot_table last 10:
               Non-Online    Online  Rating_Difference
brand_name                                            
Overose          3.363600  2.333300          -1.030300
innisfree        4.316241  3.250000          -1.066241
Valentino        4.452656  3.333350          -1.119306
Floral Street    4.237330  3.099044          -1.138286
Kaja             4.287616  3.111100          -1.176516
Algenist         4.184797  3.000000          -1.184797
Urban Decay      4.185637  2.900000          -1.285637
Lilly Lashes     4.080507  2.694450          -1.386057
MILK MAKEUP      4.030039  2.416650          -1.613389
Montblanc        4.719743  2.666700          -2.053043

Analysis:

Brand Comparison:

  1. Higher Online Ratings:

    • Christophe Robin: 3.94 (Online) vs. 2.76 (Non-Online), +1.18 – Likely benefits from detailed product descriptions and reviews that address customer concerns online.
    • ROSE Ingleton MD: 4.67 vs. 3.60, +1.07 – Possibly more appealing online due to specialized skincare details and customer testimonials.
    • TWEEZERMAN: 4.80 vs. 3.86, +0.94 – Utility-focused products with clear online value through descriptions and user feedback.
  2. Minimal Difference:

    • Josie Maran: 4.75 (Online) vs. 4.30 (Non-Online), +0.45 – Consistent performance across both channels, likely due to well-understood product effectiveness.
    • GLO Science: 4.46 vs. 4.00, +0.46 – Online reviews may slightly boost appeal, but remains stable in both settings.
  3. Higher Non-Online Ratings:

    • Montblanc: 2.67 (Online) vs. 4.72 (Non-Online), -2.05 – Significantly higher offline, likely due to the experiential nature of luxury fragrance.
    • MILK MAKEUP: 2.42 vs. 4.03, -1.61 – In-person testing of makeup products may better satisfy customer expectations.
    • Lilly Lashes: 2.69 vs. 4.08, -1.39 – Benefit from in-store visibility and direct application trials.

Conclusion & Implications: Whether to focus on online or offline sales should depend on brand positioning and the nature of the primary products sold. Enhancing online presentations for brands with lower online ratings—such as offering fragrance sampling options or virtual try-ons for cosmetics—could help close the satisfaction gap.

Note: GPT facilitate writing.

Synthesis¶

Based on the descriptive, inferential, and graphical analyses, we can summarize the following key findings:

  1. Limited Predictive Power of Price, Size, and Reviews:
    While price_usd, size_oz, and reviews have a weak positive relationship with ratings, the low R-squared value suggests these factors alone are insufficient to strongly predict ratings. Additional features may be needed for more robust prediction.

  2. Significant Differences Based on Boolean Features:
    There are significant differences in average product ratings based on whether certain attributes are true or false, including limited_edition, new, online_only, and specific ingredients like contains_Phenoxyethanol, contains_Tocopherol, contains_Limonene, contains_Linalool, contains_Silica, contains_Butylene Glycol, contains_Water, contains_Caprylic/Capric Triglyceride, contains_Citronellol, contains_Tocopheryl Acetate, and contains_Geraniol.

  3. Variation in Ratings Across Brands and Categories:
    Average ratings vary widely across brands and categories, with Gifts and Men’s products as top-performing categories, likely due to well-met customer needs, while Mini Size and Makeup receive lower ratings, possibly due to value concerns or unmet expectations. Among brands, Erno Laszlo and Aquis stand out with high ratings for quality, while Christophe Robin and Good Dye Young show lower ratings, suggesting quality or expectation gaps that may impact satisfaction.

  4. Impact of Online Only vs. Offline Availability:
    For different brands and product categories, average ratings also differ based on whether the product is online-only. Categories such as Beauty Tools and Teeth Whitening perform better online, where product reviews and detailed descriptions enhance customer confidence. In contrast, categories like Fragrance and Makeup see higher ratings offline, as customers prefer to experience the scent or test colors and textures in person. This suggests that distribution strategies can significantly impact customer satisfaction, with certain categories benefiting from a strong online presence while others may perform better with in-store availability.

Note: GPT facilitate writing. The advantage of combined dataset will be explained in the synthesis section in research problem 2 & 3 & 4.

Research Problem 2: How does customer sentiment change over time?¶

Let's first recall the combined dataset again:

In [235]:
Sephora_df.sample(5)
Out[235]:
author_id rating is_recommended helpfulness total_feedback_count total_neg_feedback_count total_pos_feedback_count submission_time review_text review_title skin_tone eye_color skin_type hair_color product_id product_name brand_name price_usd pos_feedback_percentage neg_feedback_percentage brand_id child_count child_max_price child_min_price contains_Butylene Glycol contains_Caprylic/Capric Triglyceride contains_Caprylyl Glycol contains_Citric Acid contains_Citronellol contains_Dimethicone contains_Ethylhexylglycerin contains_Geraniol contains_Glycerin contains_Limonene contains_Linalool contains_Mica contains_Phenoxyethanol contains_Potassium Sorbate contains_Propanediol contains_Silica contains_Sodium Benzoate contains_Tocopherol contains_Tocopheryl Acetate contains_Water highlights ingredients limited_edition loves_count new online_only out_of_stock price_per_ml price_per_oz primary_category reviews sale_price_usd secondary_category sephora_exclusive size size_ml size_oz tertiary_category value_price_usd variation_desc variation_type variation_value
966458 22658221971 1.0 0.0 0.615385 13.0 5.0 8.0 2022-05-09 I was sent this complimentary from Peace Out S... Smells SO Awful! fairLight green combination brown P480629 Dark Spots Serum Peace Out 29.0 0.615385 0.384615 6209 0 29.0 29.0 False False True False False False False False True False False False False False True True False False False False ['Vegan', 'Good for: Dullness/Uneven Texture',... NaN 0 7931 0 0 0 0.966667 29.000000 Skincare 252.0 29.0 Treatments 1 1 oz/ 30 mL 30.0 1.0 Face Serums 29.0 No variation Size 1 oz/ 30 mL
84728 23646034852 5.0 1.0 0.000000 0.0 0.0 0.0 2019-08-12 This peel made my skin feel refreshed. I was... Great Peel fair brown combination blonde P269122 Alpha Beta Extra Strength Daily Peel Pads Dr. Dennis Gross Skincare 92.0 1.000000 0.000000 5668 2 153.0 20.0 False False False True False False False False True False False False True False False False True False True False ['Good for: Dullness/Uneven Texture', 'Clean a... NaN 0 234295 0 0 0 0.950000 26.470588 Skincare 7412.0 92.0 Treatments 1 30 Treatments + 5 Bonus 50.0 1.7 Facial Peels 102.0 No variation Size 30 Treatments + 5 Bonus
910579 5269466174 5.0 1.0 0.000000 0.0 0.0 0.0 2022-06-15 I absolutely love Tula products so when I foun... Buy it! mediumTan brown combination black P475180 On The Go Best Sellers Travel Kit TULA Skincare 49.0 1.000000 0.000000 6373 0 49.0 49.0 True True True True True True True False True True False False True True True True True False True False ['Good for: Dullness/Uneven Texture', 'Hydrati... NaN 0 5711 0 0 0 0.950000 26.470588 Skincare 93.0 49.0 Value & Gift Sets 0 Unknown size 50.0 1.7 Uncategorized 49.0 No variation Unknown Unknown
438308 12600690044 2.0 0.0 0.571429 14.0 6.0 8.0 2021-08-29 i like how it does not leave your skin looking... No title lightMedium brown combination black P429242 Clear Sunscreen Stick SPF 50+ Shiseido 30.0 0.571429 0.428571 5337 0 30.0 30.0 False False False False False False False False False False False False False False False True False True False False ['Without Phthalates', 'allure 2021 Best of Be... NaN 0 74272 0 0 0 0.950000 42.857143 Skincare 1529.0 30.0 Sunscreen 0 0.7 oz/ 20 g 50.0 0.7 Face Sunscreen 30.0 No variation Size 0.7 oz/ 20 g
1097391 9631991606 5.0 1.0 0.969697 33.0 1.0 32.0 2017-04-02 I’m African American and I have dark spots tha... Works perfectly for me Not specified Not specified dry Not specified P94421 Vinoperfect Radiance Dark Spot Serum Vitamin C... Caudalie 82.0 0.969697 0.030303 4171 0 82.0 82.0 True False True True False False False False True False False False False True False False False False False False ['Vegan', 'Good for: Dullness/Uneven Texture',... NaN 0 166423 0 0 0 2.733333 82.000000 Skincare 2911.0 82.0 Treatments 0 1 oz/ 30 mL 30.0 1.0 Face Serums 82.0 No variation Size 1 oz/ 30 mL

Comparative Analysis¶

in this part we extract the mean and std values of customers' rating value over time to study the overall trend.

In [236]:
# code generation assisted by copilot

# Convert submission_time to datetime
Sephora_df['submission_time'] = pd.to_datetime(Sephora_df['submission_time'])

# Set submission_time as the index
Sephora_df.set_index('submission_time', inplace=True)

# Resample by month and calculate mean and standard deviation of ratings
monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))

# Display the result
print(monthly_sentiment)
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\193236108.py:10: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
                 rating_mean  rating_std
submission_time                         
2008-08-31          4.250000    0.967050
2008-09-30          4.478715    0.903466
2008-10-31          4.367816    1.076722
2008-11-30          4.459519    0.995055
2008-12-31          4.464143    0.965321
...                      ...         ...
2022-11-30          4.269916    1.106625
2022-12-31          4.477447    0.990538
2023-01-31          4.351689    1.109653
2023-02-28          4.208547    1.257636
2023-03-31          4.309767    1.161733

[176 rows x 2 columns]

From this result we can access the mean and std value of customers' ratings of certain time period on a monthly basis.

Graphical Analysis¶

With the help of visualization tools, we can see the general trend in the rating data of customers in time series by doing:

Overall Trends¶
In [237]:
# code generation assisted by copilot

import pandas as pd

import matplotlib.pyplot as plt

# Resample by month and calculate mean and standard deviation of ratings
monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))

# Plot the mean and standard deviation of ratings over time
plt.figure(figsize=(12, 6))
plt.plot(monthly_sentiment.index, monthly_sentiment['rating_mean'], label='Mean Rating', color='blue')
plt.fill_between(monthly_sentiment.index, 
                 monthly_sentiment['rating_mean'] - monthly_sentiment['rating_std'], 
                 monthly_sentiment['rating_mean'] + monthly_sentiment['rating_std'], 
                 color='blue', alpha=0.2, label='Standard Deviation')
plt.title('Customer Sentiment Over Time')
plt.xlabel('Time')
plt.ylabel('Rating')
plt.legend()
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\1542363257.py:8: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_sentiment = Sephora_df['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))

From the figure above we can see that the rating remains rather stable with all mean values roughly around 4.0 - 4.5, however the std values cover a rather large range, meaning that the ratings between customers vary among each other.

Category-Specific Trends¶
In [238]:
# code generation assisted by ChatGPT

# Get unique primary categories
primary_categories = Sephora_df['secondary_category'].unique()
num_categories = len(primary_categories)

# Set up a grid for subplots (adjust rows and columns based on number of categories)
rows = (num_categories + 1) // 2  # 2 columns per row
fig, axes = plt.subplots(rows, 2, figsize=(15, rows * 5))
axes = axes.flatten()  # Flatten the axes array for easy iteration

# Loop through each primary category and plot on respective subplot
for i, category in enumerate(primary_categories):
    # Filter the data for the current primary category
    category_data = Sephora_df[Sephora_df['secondary_category'] == category]
    
    # Resample by month and calculate mean and standard deviation of ratings
    monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
    
    # Plot on the i-th subplot
    axes[i].plot(monthly_sentiment.index, monthly_sentiment['rating_mean'], label='Mean Rating', color='blue')
    axes[i].fill_between(monthly_sentiment.index, 
                         monthly_sentiment['rating_mean'] - monthly_sentiment['rating_std'], 
                         monthly_sentiment['rating_mean'] + monthly_sentiment['rating_std'], 
                         color='blue', alpha=0.2, label='Standard Deviation')
    axes[i].set_title(f'Customer Sentiment Over Time - {category}')
    axes[i].set_xlabel('Time')
    axes[i].set_ylabel('Rating')
    axes[i].legend()

# Hide any empty subplots if the number of categories is odd
for j in range(i + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\4191964277.py:18: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))

Analysis: (GPT facilitate writing)

  1. Stable Categories (e.g., Eye Care, Treatments): User ratings are consistent over time.

    • Strategy: Focus on maintaining quality and gathering feedback to sustain customer satisfaction.
  2. Declining Categories (e.g. Lip Balms & Treatments): User ratings show a downward trend, suggesting diminishing satisfaction.

    • Strategy: Investigate causes and consider product improvements or repositioning to realign with customer expectations.
  3. Fluctuating but Stabilizing Categories (e.g., Sunscreen, High Tech Tools, Self Tanners): Ratings fluctuate initially but stabilize over time.

    • Strategy: Leverage positive trends with consistent quality and targeted marketing to build loyalty.

In summary, stable categories should prioritize consistency, declining ones require revitalization, and stabilizing categories should focus on solidifying customer trust. This targeted approach helps address customer needs effectively across different product types.

Brand-Specific Trends¶
In [239]:
# code generation assisted by ChatGPT

low_perform_brand = low_perform_brand.to_list()

# Filter the dataset for only the low-performing brands
filtered_df = Sephora_df[Sephora_df['brand_name'].isin(low_perform_brand)]
primary_categories = filtered_df['brand_name'].unique()  # Get unique brand names from filtered data
num_categories = len(primary_categories)

filtered_df['brand_name'].value_counts()
Out[239]:
brand_name
Supergoop!                           19675
Isle of Paradise                      5985
TAN-LUXE                              1490
Dr. Lara Devgan Scientific Beauty     1421
Slip                                   692
NuFACE                                 495
Glossier                               386
INC.redible                            218
iluminage                              100
Name: count, dtype: int64
In [240]:
# Set up a grid for subplots (adjust rows and columns based on number of categories)
rows = (num_categories + 1) // 2  # 2 columns per row
fig, axes = plt.subplots(rows, 2, figsize=(15, rows * 5))
axes = axes.flatten()  # Flatten the axes array for easy iteration

# Loop through each low-performing brand and plot on respective subplot
for i, category in enumerate(primary_categories):
    # Filter the data for the current brand
    category_data = filtered_df[filtered_df['brand_name'] == category]
    
    # Resample by month and calculate mean and standard deviation of ratings
    monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))
    
    # Plot on the i-th subplot
    axes[i].plot(monthly_sentiment.index, monthly_sentiment['rating_mean'], label='Mean Rating', color='blue')
    axes[i].fill_between(monthly_sentiment.index, 
                         monthly_sentiment['rating_mean'] - monthly_sentiment['rating_std'], 
                         monthly_sentiment['rating_mean'] + monthly_sentiment['rating_std'], 
                         color='blue', alpha=0.2, label='Standard Deviation')
    axes[i].set_title(f'Customer Sentiment Over Time - {category}')
    axes[i].set_xlabel('Time')
    axes[i].set_ylabel('Rating')
    axes[i].legend()

# Hide any empty subplots if the number of categories is odd
for j in range(i + 1, len(axes)):
    axes[j].axis('off')

plt.tight_layout()
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\3119811863.py:12: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  monthly_sentiment = category_data['rating'].resample('M').agg(rating_mean=('mean'), rating_std=('std'))

Analysis of Brand Trends: (GPT facilitate writing)

  1. Stable Brands (e.g., TAN-LUXE, Isle of Paradise): These brands show relatively stable ratings over time with minor fluctuations.

    • Strategy: Maintain quality and continue gathering customer feedback to sustain satisfaction.
  2. Declining Brands (e.g., Supergoop!, Glossier): Ratings show a downward trend, indicating decreasing customer satisfaction.

    • Strategy: Investigate potential issues such as quality changes or unmet expectations, and consider product or service improvements to regain customer trust.

Synthesis¶

GPT polished

From the analysis above we can see that:

  • Overall Trends: For ratings on all products as a whole, the overall rating is rather stable with little fluctuation.
  • Brand-Specific Trends: Certain brands, such as TAN-LUXE and Isle of Paradise, show stable ratings over time, indicating consistent customer satisfaction. In contrast, brands like Supergoop! and Glossier exhibit a downward trend, suggesting declining satisfaction that may require attention.
  • Category-Specific Trends: Categories also demonstrate distinct patterns, with some (e.g., Eye Care, Treatments) maintaining stable ratings, while others (e.g., Lip Balms & Treatments) show obvious declines. This highlights the need for tailored strategies for each category based on their trend.

Strength of the combination of the datasets:

The integration of timestamped ratings with brand and category data allows for a comprehensive analysis of rating trends over time across various product segments. Without this combined dataset, it would be challenging to observe patterns specific to brands or categories, limiting our ability to generate actionable insights based on time-related trends. This integrated approach reveals valuable insights into customer sentiment shifts and informs targeted strategies for different brands and categories.

Research Problem 3: Are certain ingredients linked to positive or negative customer experiences?¶

For this part we take rating as the criteria of customer experiences, indicating them as being positive/negative towards the certain product.

Descriptive Statistics¶

  • Objective: Use descriptive statistics to explore associations between ingredients and customer experience scores.
  • Method: Calculate the mean, variance, and standard deviation of sentiment ratings for products containing specific ingredients to gauge their effects.
  • Expected Outcome: Observe if specific ingredients are associated with higher or lower sentiment scores, offering a preliminary indication of their impact on customer experience.
In [241]:
# code generation assisted by copilot
# List of ingredients to analyze
ingredients_to_analyze = ['Glycerin', 'Phenoxyethanol', 'Caprylyl Glycol', 'Tocopherol', 'Limonene', 
                          'Ethylhexylglycerin', 'Dimethicone', 'Linalool', 'Silica', 'Butylene Glycol', 
                          'Citric Acid', 'Mica', 'Potassium Sorbate', 'Water', 'Caprylic/Capric Triglyceride', 
                          'Sodium Benzoate', 'Citronellol', 'Tocopheryl Acetate', 'Propanediol', 'Geraniol']

# Dictionary to store descriptive statistics
ingredient_stats = {}

# Calculate mean, variance, and standard deviation for each ingredient
for ingredient in ingredients_to_analyze:
    column_name = f'contains_{ingredient}'
    if column_name in Sephora_df.columns:
        ratings = Sephora_df[Sephora_df[column_name] == True]['rating']
        mean_rating = ratings.mean()
        variance_rating = ratings.var()
        std_dev_rating = ratings.std()
        
        ingredient_stats[ingredient] = {
            'mean': mean_rating,
            'variance': variance_rating,
            'std_dev': std_dev_rating
        }

# Convert the dictionary to a DataFrame for better visualization
ingredient_stats_df = pd.DataFrame(ingredient_stats).T
ingredient_stats_df.columns = ['Mean Rating', 'Variance', 'Standard Deviation']

# Display the result
print(ingredient_stats_df.sort_values('Mean Rating', ascending=False))
                              Mean Rating  Variance  Standard Deviation
Geraniol                         4.396920  1.102381            1.049943
Citronellol                      4.385853  1.151699            1.073172
Limonene                         4.374090  1.190390            1.091050
Linalool                         4.336357  1.260056            1.122522
Citric Acid                      4.312361  1.250098            1.118078
Caprylic/Capric Triglyceride     4.301079  1.354193            1.163698
Sodium Benzoate                  4.297634  1.284924            1.133545
Propanediol                      4.296871  1.301729            1.140934
Caprylyl Glycol                  4.296258  1.269184            1.126581
Glycerin                         4.295407  1.306163            1.142875
Tocopherol                       4.294736  1.320070            1.148943
Potassium Sorbate                4.294610  1.246196            1.116331
Dimethicone                      4.291479  1.262378            1.123556
Tocopheryl Acetate               4.287476  1.240941            1.113975
Butylene Glycol                  4.285907  1.288419            1.135085
Ethylhexylglycerin               4.279360  1.336498            1.156070
Silica                           4.278348  1.346423            1.160355
Phenoxyethanol                   4.261746  1.359882            1.166140
Water                            4.230556  1.431122            1.196295
Mica                             4.208789  1.521640            1.233548

Above displayed are the mean, variance and std values of sentiment ratings for products containing different ingredients.

Inferential Statistics¶

  • Objective: Test the significant relationship between specific ingredients and customer experience.
  • Method: Use chi-squared tests to assess the relationship between ingredients and sentiment scores, or apply t-tests and ANOVA to compare ratings between products with and without certain ingredients.
  • Expected Outcome: Confirm whether certain ingredients are significantly associated with differences in customer experience, supporting or refuting their influence on sentiment.
In [242]:
# code generation assisted by copilot
from scipy.stats import chi2_contingency, ttest_ind, f_oneway

# Chi-squared test for categorical data
def chi_squared_test(df, ingredient):
    contingency_table = pd.crosstab(df[ingredient], df['rating'])
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    return chi2, p

# T-test for comparing means between two groups
def t_test(df, ingredient):
    group_with = df[df[ingredient] == True]['rating']
    group_without = df[df[ingredient] == False]['rating']
    t_stat, p = ttest_ind(group_with, group_without, nan_policy='omit')
    return t_stat, p

# ANOVA for comparing means across multiple groups
def anova_test(df, ingredient):
    group_with = df[df[ingredient] == True]['rating']
    group_without = df[df[ingredient] == False]['rating']
    f_stat, p = f_oneway(group_with, group_without)
    return f_stat, p

# List of ingredients to analyze
ingredients_to_analyze = ['contains_Glycerin', 'contains_Phenoxyethanol', 'contains_Caprylyl Glycol', 
                          'contains_Tocopherol', 'contains_Limonene', 'contains_Ethylhexylglycerin', 
                          'contains_Dimethicone', 'contains_Linalool', 'contains_Silica', 
                          'contains_Butylene Glycol', 'contains_Citric Acid', 'contains_Mica', 
                          'contains_Potassium Sorbate', 'contains_Water', 'contains_Caprylic/Capric Triglyceride', 
                          'contains_Sodium Benzoate', 'contains_Citronellol', 'contains_Tocopheryl Acetate', 
                          'contains_Propanediol', 'contains_Geraniol']

# Dictionary to store test results
test_results = {}

# Perform tests for each ingredient
for ingredient in ingredients_to_analyze:
    chi2, chi2_p = chi_squared_test(Sephora_df, ingredient)
    t_stat, t_p = t_test(Sephora_df, ingredient)
    f_stat, f_p = anova_test(Sephora_df, ingredient)
    
    test_results[ingredient] = {
        'chi2_p': chi2_p,
        't_p': t_p,
        'anova_p': f_p
    }

# Convert the results to a DataFrame for better visualization
test_results_df = pd.DataFrame(test_results).T
test_results_df.columns = ['Chi-squared p-value', 'T-test p-value', 'ANOVA p-value']

# Display the results
print(test_results_df)
                                       Chi-squared p-value  T-test p-value  \
contains_Glycerin                            2.238161e-154    8.434185e-07   
contains_Phenoxyethanol                       0.000000e+00   7.232323e-204   
contains_Caprylyl Glycol                     9.270977e-214    9.466178e-02   
contains_Tocopherol                           2.357263e-11    3.627108e-03   
contains_Limonene                            7.769743e-278   5.029388e-268   
contains_Ethylhexylglycerin                   9.687556e-60    9.779087e-31   
contains_Dimethicone                         2.237796e-222    4.070112e-04   
contains_Linalool                             4.279626e-43    3.621294e-41   
contains_Silica                               5.458571e-16    1.289459e-11   
contains_Butylene Glycol                      0.000000e+00    2.069692e-22   
contains_Citric Acid                         1.015159e-174    5.422440e-17   
contains_Mica                                1.138734e-172   2.051101e-125   
contains_Potassium Sorbate                    0.000000e+00    1.737248e-02   
contains_Water                                0.000000e+00    0.000000e+00   
contains_Caprylic/Capric Triglyceride         3.457018e-99    3.146970e-01   
contains_Sodium Benzoate                      1.254051e-83    3.989864e-01   
contains_Citronellol                         1.648621e-144   8.661169e-146   
contains_Tocopheryl Acetate                   0.000000e+00    1.235899e-07   
contains_Propanediol                          6.316156e-40    1.419154e-01   
contains_Geraniol                            6.076203e-180   1.194250e-164   

                                       ANOVA p-value  
contains_Glycerin                       8.434185e-07  
contains_Phenoxyethanol                7.232323e-204  
contains_Caprylyl Glycol                9.466178e-02  
contains_Tocopherol                     3.627108e-03  
contains_Limonene                      5.029388e-268  
contains_Ethylhexylglycerin             9.779087e-31  
contains_Dimethicone                    4.070112e-04  
contains_Linalool                       3.621294e-41  
contains_Silica                         1.289459e-11  
contains_Butylene Glycol                2.069692e-22  
contains_Citric Acid                    5.422440e-17  
contains_Mica                          2.051101e-125  
contains_Potassium Sorbate              1.737248e-02  
contains_Water                          0.000000e+00  
contains_Caprylic/Capric Triglyceride   3.146970e-01  
contains_Sodium Benzoate                3.989864e-01  
contains_Citronellol                   8.661169e-146  
contains_Tocopheryl Acetate             1.235899e-07  
contains_Propanediol                    1.419154e-01  
contains_Geraniol                      1.194250e-164  

interpretation assisted by GPT-4o

Interpretation of result:

  1. Criteria for Statistical Significance:

    • Typically, a p-value less than 0.05 indicates statistical significance. Therefore, if any of the p-values for an ingredient (across Chi-squared, T-test, or ANOVA) are below 0.05, it suggests that the ingredient has a significant effect on customer sentiment or experience.
  2. Key Findings:

    • Ingredients with extremely low p-values in all three tests (Chi-squared, T-test, and ANOVA) are likely to have a substantial and consistent impact on customer experience. For example:
      • contains_Glycerin:
        • Chi-squared p-value: 2.24e-154
        • T-test p-value: 8.43e-07
        • ANOVA p-value: 8.43e-07
        • Interpretation: Glycerin is significantly associated with differences in ratings, suggesting it likely influences customer sentiment positively or negatively, depending on the type of rating data.
      • contains_Phenoxyethanol:
        • Chi-squared p-value: 0.000000
        • T-test p-value: 7.23e-204
        • ANOVA p-value: 7.23e-204
        • Interpretation: Phenoxyethanol shows a strong association with customer ratings, indicating it has a significant impact on customer experience.
      • contains_Limonene:
        • Chi-squared p-value: 7.77e-278
        • T-test p-value: 5.03e-268
        • ANOVA p-value: 5.03e-268
        • Interpretation: Limonene has highly significant p-values across all tests, suggesting it may be a crucial factor in customer experience.
  3. Ingredients with Mixed Significance:

    • Some ingredients show significance in one or two tests but not in all three. For example:
      • contains_Caprylyl Glycol:
        • Chi-squared p-value: 9.27e-214
        • T-test p-value: 0.0947 (not significant)
        • ANOVA p-value: 0.0947 (not significant)
        • Interpretation: Caprylyl Glycol is significant in the Chi-squared test, suggesting an association with rating categories, but not in the T-test or ANOVA, indicating that its effect on customer experience may be less direct or context-dependent.
      • contains_Tocopherol:
        • Chi-squared p-value: 2.36e-11
        • T-test p-value: 0.0036
        • ANOVA p-value: 0.0036
        • Interpretation: Tocopherol is significant in all three tests, though the effect size may be smaller than ingredients with much lower p-values.
  4. Ingredients with No Significant Effect:

    • Ingredients with p-values consistently above 0.05 are not significantly associated with customer experience in a measurable way. For example:
      • contains_Caprylic/Capric Triglyceride:
        • Chi-squared p-value: 3.46e-99 (significant in Chi-squared but not in T-test or ANOVA)
        • T-test p-value: 0.315
        • ANOVA p-value: 0.315
        • Interpretation: This ingredient has no significant effect on the mean customer experience ratings but may slightly influence categorical rating distributions.
  5. Summary:

    • Strong Influencers: Ingredients like Glycerin, Phenoxyethanol, and Limonene have very low p-values across all tests, indicating that they are likely associated with significant differences in customer experience. These ingredients could be primary factors influencing customer sentiment.
    • Moderate Influencers: Ingredients like Tocopherol and Dimethicone show moderate significance and may contribute to customer experience, albeit with a smaller effect size.
    • Non-Influencers: Ingredients such as Caprylic/Capric Triglyceride and Sodium Benzoate have high p-values in T-tests and ANOVA, suggesting they do not significantly affect customer experience ratings.

Conclusion

This analysis confirms that certain ingredients (e.g., Glycerin, Phenoxyethanol, Limonene) are significantly associated with differences in customer experience, either positively or negatively, supporting their influence on sentiment. However, not all ingredients have a notable impact, highlighting the importance of ingredient selection in product formulation for customer satisfaction.

Comparative Analysis¶

  • Objective: Compare sentiment scores among different ingredient combinations to explore ingredient-related differences.
  • Method: Compare sentiment ratings for products with different ingredient combinations to identify significant differences.
  • Expected Outcome: Identify ingredient combinations associated with notably positive or negative experiences, providing insights into ingredient impact on customer sentiment.
In [243]:
# code generation assisted by copilot

# Define the ingredient combinations to analyze
ingredient_combinations = [
    ['contains_Glycerin', 'contains_Phenoxyethanol'],
    ['contains_Caprylyl Glycol', 'contains_Tocopherol'],
    ['contains_Limonene', 'contains_Ethylhexylglycerin'],
    ['contains_Dimethicone', 'contains_Linalool'],
    ['contains_Silica', 'contains_Butylene Glycol']
]

# Dictionary to store the results
combination_results = {}

# Loop through each combination and perform the analysis
for combination in ingredient_combinations:
    # Filter the dataframe for products containing the specific combination of ingredients
    filtered_df = Sephora_df
    for ingredient in combination:
        filtered_df = filtered_df[filtered_df[ingredient] == True]
    
    # Calculate the mean rating for the combination
    mean_rating = filtered_df['rating'].mean()
    
    # Perform ANOVA to test for significant differences
    groups = [Sephora_df[Sephora_df[ingredient] == True]['rating'] for ingredient in combination]
    f_stat, p_value = f_oneway(*groups)
    
    # Store the results
    combination_results[tuple(combination)] = {
        'mean_rating': mean_rating,
        'f_stat': f_stat,
        'p_value': p_value
    }

# Convert the results to a DataFrame for better visualization
combination_results_df = pd.DataFrame(combination_results).T
combination_results_df.columns = ['Mean Rating', 'F-statistic', 'P-value']

# Display the results
print(combination_results_df)
                                                      Mean Rating  \
contains_Glycerin        contains_Phenoxyethanol         4.273820   
contains_Caprylyl Glycol contains_Tocopherol             4.299253   
contains_Limonene        contains_Ethylhexylglycerin     4.314862   
contains_Dimethicone     contains_Linalool               4.312045   
contains_Silica          contains_Butylene Glycol        4.356550   

                                                      F-statistic  \
contains_Glycerin        contains_Phenoxyethanol       250.416987   
contains_Caprylyl Glycol contains_Tocopherol             0.304765   
contains_Limonene        contains_Ethylhexylglycerin   933.539286   
contains_Dimethicone     contains_Linalool             142.575725   
contains_Silica          contains_Butylene Glycol        4.230261   

                                                            P-value  
contains_Glycerin        contains_Phenoxyethanol       2.133692e-56  
contains_Caprylyl Glycol contains_Tocopherol           5.809106e-01  
contains_Limonene        contains_Ethylhexylglycerin  7.490995e-205  
contains_Dimethicone     contains_Linalool             7.379258e-33  
contains_Silica          contains_Butylene Glycol      3.970979e-02  

analysis generation assisted by copilot

Analysis of Ingredient Combinations and Their Impact on Customer Sentiment

  • Objective

To identify ingredient combinations associated with notably positive or negative experiences, providing insights into their impact on customer sentiment.

  • Method

We analyzed the mean rating, F-statistic, and p-value for various ingredient combinations to determine their significance and impact on customer sentiment.

  • Results
  1. Combination: contains_Glycerin and contains_Phenoxyethanol

    • Mean Rating: 4.27
    • F-statistic: 250.42
    • P-value: 2.13e-56
    • Insight: This combination is highly significant and positively associated with customer sentiment, indicating that products containing both Glycerin and Phenoxyethanol are well-received by customers.
  2. Combination: contains_Caprylyl Glycol and contains_Tocopherol

    • Mean Rating: 4.30
    • F-statistic: 0.30
    • P-value: 0.58
    • Insight: This combination shows a high mean rating but is not statistically significant, suggesting that while customers may rate these products highly, the combination does not have a strong impact on sentiment.
  3. Combination: contains_Limonene and contains_Ethylhexylglycerin

    • Mean Rating: 4.31
    • F-statistic: 933.54
    • P-value: 7.49e-205
    • Insight: This combination is extremely significant and positively associated with customer sentiment, indicating that products containing both Limonene and Ethylhexylglycerin are highly favored by customers.
  4. Combination: contains_Dimethicone and contains_Linalool

    • Mean Rating: 4.31
    • F-statistic: 142.58
    • P-value: 7.38e-33
    • Insight: This combination is highly significant and positively associated with customer sentiment, suggesting that products containing both Dimethicone and Linalool are well-received.
  5. Combination: contains_Silica and contains_Butylene Glycol

    • Mean Rating: 4.36
    • F-statistic: 4.23
    • P-value: 0.04
    • Insight: This combination is significant and positively associated with customer sentiment, indicating that products containing both Silica and Butylene Glycol are generally well-received by customers.
  • Conclusion

The analysis reveals that certain ingredient combinations, such as contains_Glycerin and contains_Phenoxyethanol, contains_Limonene and contains_Ethylhexylglycerin, and contains_Dimethicone and contains_Linalool, are notably associated with positive customer experiences. These combinations have high mean ratings and statistically significant p-values, indicating their strong impact on customer sentiment. On the other hand, combinations like contains_Caprylyl Glycol and contains_Tocopherol show high mean ratings but lack statistical significance, suggesting a less direct impact on sentiment.

These insights can guide product formulation and marketing strategies to enhance customer satisfaction by focusing on effective ingredient combinations.

Synthesis¶

summary synthesis assisted by copilot

  • Certain ingredients (e.g., Glycerin, Phenoxyethanol, Limonene) significantly impact customer experience, either positively or negatively.
  • Not all ingredients have a notable impact, emphasizing the importance of ingredient selection in product formulation.
  • Ingredient combinations such as contains_Glycerin and contains_Phenoxyethanol, contains_Limonene and contains_Ethylhexylglycerin, and contains_Dimethicone and contains_Linalool are associated with positive customer experiences.
  • These combinations have high mean ratings and statistically significant p-values, indicating a strong impact on customer sentiment.
  • Combinations like contains_Caprylyl Glycol and contains_Tocopherol show high mean ratings but lack statistical significance, suggesting a less direct impact on sentiment.
  • Strength of the combination of the datasets: allow us to have access to the mean, var, and std information of all the skin-care products specifically. Without combination, only mean value is present in the product dataframe, while in the review dataframe no exact information about the specific skin-care category exists.

Research Problem 4: How certain ingredients link to effectiveness of customers with different skin types?¶

With the common sense that different skin types may require different kinds of cosmetics, we look into what are the preference and effectiveness differences between people with different skin types.

First we see how many types of skin are there listed in the dataset:

In [244]:
Sephora_df['skin_type'].value_counts()
Out[244]:
skin_type
combination      544513
dry              185937
normal           131910
oily             120494
Not specified    111557
Name: count, dtype: int64

To better see the number of occurrences for each type, we can apply the visualziation tool as follows.

In [245]:
# prompt to copilot: use plotting tools to visualize the number occurrences of each skin types

import seaborn as sns

import matplotlib.pyplot as plt

# Count the occurrences of each skin type
skin_type_counts = Sephora_df['skin_type'].value_counts()

# Plot the data
plt.figure(figsize=(10, 6))
sns.barplot(x=skin_type_counts.index, y=skin_type_counts.values, palette='viridis')
plt.title('Number of Occurrences of Each Skin Type')
plt.xlabel('Skin Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
C:\Users\86151\AppData\Local\Temp\ipykernel_28580\154163572.py:12: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=skin_type_counts.index, y=skin_type_counts.values, palette='viridis')

From the figure we can see that the combination type has the most number of occurrences, with all other types quite similar to each other in number.

Comparative Analysis¶

In this part we study the impact of Glycerin contain on product helpfulness across different categories of skin types.

Prepare data¶

To study the differences between two contrast types of skins, we look specifically into the differences of helpfulness towards dry and oily types of skins, for products with/without certain ingredients. First we nnsure that skin_type and contains_Glycerin in Sephora_df are categorical variables so we can perform the ANOVA analysis.

In [246]:
# code generation and interpretation assisted by ChatGPT-4o

Sephora_df['skin_type'] = Sephora_df['skin_type'].astype('category')
Sephora_df['contains_Glycerin'] = Sephora_df['contains_Glycerin'].astype('category')
Build the ANOVA Model¶

We can use a two-way ANOVA model to analyze if the mean helpfulness differs between skin_type and contains_Glycerin, and if there is an interaction effect between these two factors.

In [247]:
from statsmodels.formula.api import ols
model = ols('helpfulness ~ C(skin_type) * C(contains_Glycerin)', data=Sephora_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
                                          sum_sq         df           F  \
C(skin_type)                          299.254643        4.0  381.820830   
C(contains_Glycerin)                    6.787471        1.0   34.640704   
C(skin_type):C(contains_Glycerin)      28.539878        4.0   36.414204   
Residual                           214436.036012  1094401.0         NaN   

                                         PR(>F)  
C(skin_type)                       0.000000e+00  
C(contains_Glycerin)               3.966425e-09  
C(skin_type):C(contains_Glycerin)  1.742991e-30  
Residual                                    NaN  
Interpret Results¶

Based on the ANOVA result, we can get:

  1. Effect of skin_type

    • F-value: 381.82
    • p-value (PR(>F)): 0.000000
    • Interpretation: The p-value for C(skin_type) is 0, which is less than 0.05, indicating a statistically significant effect. This suggests that there is a significant difference in the mean helpfulness score across different skin types (Dry, Oily, etc.).
  2. Effect of contains_Glycerin

    • F-value: 34.64
    • p-value (PR(>F)): 3.97e-09
    • Interpretation: The p-value for C(contains_Glycerin) is also very small (much less than 0.05), indicating a statistically significant effect. This suggests that whether a product contains glycerin significantly affects the helpfulness score, regardless of skin type.
  3. Interaction Effect between skin_type and contains_Glycerin

    • F-value: 36.41
    • p-value (PR(>F)): 1.74e-30
    • Interpretation: The interaction term C(skin_type):C(contains_Glycerin) has a p-value close to zero, which is highly significant. This indicates that the effect of glycerin content on helpfulness varies by skin type. In other words, the impact of glycerin on perceived helpfulness is different for different skin types.
  • Summary
    • Significant main effects: Both skin_type and contains_Glycerin independently influence the mean helpfulness.
    • Significant interaction effect: The effect of glycerin on helpfulness depends on skin type.

These results suggest that when evaluating product helpfulness, it's important to consider both the user's skin type and whether the product contains glycerin, as their interaction plays a significant role.

Graphical Analysis¶

To observe the mean differences across groups more clearly, we can use seaborn to create a grouped box plot:

In [248]:
sns.boxplot(x='contains_Glycerin', y='helpfulness', hue='skin_type', data=Sephora_df)
plt.title('Comparison of Helpfulness by Skin Type and Glycerin Content')
plt.show()

Synthesis¶

same as above in the summary part

  • Summary
    • Significant main effects: Both skin_type and contains_Glycerin independently influence the mean helpfulness.
    • Significant interaction effect: The effect of glycerin on helpfulness depends on skin type.

These results suggest that when evaluating product helpfulness, it's important to consider both the user's skin type and whether the product contains glycerin, as their interaction plays a significant role.

  • Strength of the combination of the datasets: by combining two datasets we can at the same time have field information from both datasets, including skin_type and helpfulness comment from review dataset and ingredient information from product dataset. This makes it possible for us to analyze how/whether certain ingredient contribute to customers' feeling with the products.

Reflection¶

In the process of analyzing this dataset, we gained valuable insights into handling various data types, dealing with missing values, and selecting appropriate analytical and visualization methods. Each step, from initial data understanding to more targeted analyses, required careful consideration to extract meaningful insights aligned with our research objectives.

Handling missing values was one of the initial challenges. We approached this by assessing the context and importance of each variable. For numerical data, missing values were typically imputed with the mean or median when appropriate, allowing for continuity in statistical analyses without distorting data distributions. For categorical and boolean data, missing values were either filled with the mode or retained as a distinct category if they represented meaningful absence (e.g., indicating the non-availability of a feature). This strategy ensured that missing values didn’t bias the analysis while still retaining as much information as possible.

The dataset contained a variety of data types, each requiring distinct handling and analysis techniques. Numerical data, such as product prices and sizes, was analyzed using descriptive statistics to understand distributions and general trends. For categorical data like product categories and brands, we employed grouping and pivot tables to uncover rating trends across different classes. Boolean variables, which indicated the presence or absence of certain features, were analyzed using ANOVA tests and visualized with violin plots to capture differences in ratings. Text data are processed using advanced techniques like sentiment analysis or text categorization.

Selecting the right analytical and visualization methods was essential to effectively interpret the data. We began with exploratory data analysis to understand the basic characteristics and patterns, for example using histograms for numerical variables. When examining relationships between variables relevant to the research questions, we used scatter plots for continuous relationships and hexbin plots for dense data, as they provided clearer insights than traditional scatter plots. For categorical comparisons, violin plots and box plots were highly effective in displaying distributional differences across categories.

The process involved several stages, starting with understanding the data and conducting basic exploratory analyses to get an overview of trends. We then moved to more specific analyses based on the research questions, focusing on key features that were likely to impact product ratings. This approach allowed us to build a logical flow, moving from broad insights to specific findings relevant to our objectives.

Throughout the analysis, we encountered a few challenges, particularly in selecting the most appropriate methods for visualization and handling the diverse data types. For instance, deciding on how to treat missing values without introducing bias was complex, as different variables required different approaches. The variety of data types also made it necessary to use multiple analytical techniques, which required constant adjustment and validation. Despite these challenges, we found that iterating on the analysis methods and continually aligning them with the research questions helped produce clearer and more actionable insights. This project underscored the importance of flexibility and methodological rigor when working with real-world data.

Note: GPT polished.